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Abstract 


Wide-area indoor people detection in a network of depth sensors is the 
basis for many applications, e. g. people counting, customer behavior anal- 
ysis, public security or ambient assisted living. In contrast to classical 
pedestrian detection approaches, depth sensors typically capture the scene 
from the top-view to minimize occlusions in crowded scenes. As a con- 
sequence of the vertical top-view, position changes of individuals lead 
to drastically varying appearances. This makes the people detection task 
quite challenging for off-the-shelf, data-driven pedestrian detectors. More- 
over, people are occasionally only partially visible in a single view due to 
occlusion, measurement noise or the limited field of view of a top-view 
depth sensor. 

Considering the aforementioned challenges, the present thesis examines 
how to exploit the full temporal multi-view image evidence. In particu- 
lar, we investigate how the redundant and complementary multi-view 
information, including the temporal context, can be jointly leveraged to 
improve the detection performance. We recast the problem of multi-view 
people detection in a sequence of overlapping depth images as an inverse 
problem and present a generative probabilistic framework to jointly ex- 
ploit the temporal multi-view image evidence. Based on the proposed 
probabilistic model, we examine several inference methods, including con- 
tinuous gradient based optimization, variational inference and end-to-end 
deep learning methods. As our main contribution, we propose to use 
mean-field variational inference to not only estimate the maximum a pos- 
teriori (MAP) state, but to also approximate the joint posterior probability 
distribution of people present in the scene across space and time. 


Abstract 


For evaluation, we introduce a novel data set for indoor people detection 
in multiple overlapping top-view depth images. We report state-of-the 
art results for the proposed mean-field variational inference methods. 
Furthermore, we demonstrate that, compared to the frame-by-frame mono- 
view setup, our approach successfully exploits the temporal multi-view 
image evidence and robustly converges in only a few iterations. 
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1 Introduction 


By virtue of the emergence of low-cost commodity depth sensors', there 
is an increasing demand for privacy-preserving wide-area people detec- 
tion in various indoor scenarios, e. g. people counting, customer behavior 
analysis, public security, ambient assisted living or smart homes. In con- 
trast to the classical outdoor video surveillance scenario, the mounting 
height is very limited in many indoor scenarios. Therefore, the depth 
sensors typically capture the scene from the top-view to reduce occlusions 
in crowded scenes. However, as a direct consequence of the top-view and 
the limited mounting height, the resulting field of view and observable 
area of a single depth sensor is quite limited. This is an issue in many 
real-world applications such as customer behavior analysis in shopping 
malls or airports. Providing complete detection in a wide-area scenario is 
achieved by employing a network of multiple sensors, covering a larger 
area. 

Apart from the increased observable area, there are additional advan- 
tages compared to the classical single-view approach. Since a single view 
does not capture all the details in a 3D scene, considering additional par- 
tially overlapping views provides more information about the true scene 
state. This is especially relevant in situations where people are only par- 
tially visible in one camera view due to occlusion or the limited field 
of view (see Fig. 1.1). Hence, the detection performance (including the 
detection confidence) in the overlapping regions can be improved by com- 
plementary image evidence from multiple views. In particular, this is 


PA depth sensor refers to a range imaging sensor, which measures the distance to points in 


the scene. 
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(d) Observation 1 (01) (e) Observation 2 (05) (f) Observation 3 (03) 


Figure 1.1 Example observations from our multi-view setup with six people present in the 
covered area, marked with unique colors. In view (a) the magenta marked person occludes 
the person marked with green, while in (b) the pair can be clearly separated. In the views 
(b,c) several people are only partially visible. Note that for inference only the depth images 
(d-f) are used. 


relevant in demanding applications such as emergency detection in an 
ambient assisted living context. 

The general problem of people detection in a multi-camera setup has 
been widely studied in the computer vision literature. Nonetheless, the 
vast majority of existing multi-view approaches use monocular video 
cameras and focus on pedestrian detection in outdoor scenarios, capturing 
the pedestrians from profile or frontal view. In contrast, in this thesis we 
focus on the task of top-view people detection in a network of low-cost 
commodity depth sensors. 


: Unambiguously, the present work is a contribution from the sole author. However, for the 
ease of an active voice writing style, in this thesis the first-person plural is used instead of 
the first-person singular. Depending on the context, "we" refers to the sole author or the 
reader and the author, respectively. 


1.1 Problem Statement 


1.1 Problem Statement 


In this thesis we focus on indoor people detection in a sequence of con- 
secutive multi-view frames, where each multi-view frame is given by 
multiple overlapping depth images. The depth images are obtained by a 
network of low-resolution commodity depth sensors. The sensors have a 
top-view on the scene with fields of view having a significant joint overlap. 
Fig. 1.1 shows one exemplary multi-view frame obtained by our setup. The 
grayscale images in the top-row (Fig. 1.1 (a-c)) only serve for the purpose 
of visualization. For inference, we use the foreground-segmented depth 
observations in Fig. 1.1 (d-f). The major challenges are summarized as 
follows: 


« Due to the passive? low-resolution stereo sensors, the depth obser- 
vations suffer from heavy measurement noise, including areas with 
no measurements. 


= Asa consequence of the vertical top-view, position changes of indi- 
viduals lead to drastically varying appearances (see Fig. 1.1). This 
makes the single view detection task challenging for discrimina- 
tive detection algorithms, especially for off-the-shelf data-driven 
pedestrian detectors without a domain-specific large scale data set. 


= Frequently, people are only partially visible in one camera view 
due to occlusion, the limited field of view or measurement noise. 
However, a typical detector operates independently on a single view 
and thereby does not make use of the given complementary image 
evidence. The challenge is to leverage the full multi-view image 
evidence, including multiple overlapping regions, in order to gain 
more information about the true state of the scene. 


= The depth sensors provide a pseudo synchronized sequence of con- 
secutive frames with approximately 15 frames per second. It is 


3 In contrast to active depth sensors (e. g. time-of-flight or structured light based devices), 


passive stereo sensors do not emit any signal. 
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desirable to additionally make use of this temporal information in 
order to improve the overall detection performance. 


To overcome these challenges, the overarching idea followed in this 
thesis is to jointly make use of the temporal multi-view image evidence. 
We aim for methods which are able to make use of the given redundant and 
complementary image evidence, including the temporal context. At the 
same time we want to minimize the loss of information typically occurring 
in a cascade of independently applied algorithms, each operating on 
lossy representations of the output of its predecessor (e.g. a tracking- 
by-detection pipeline applied to each single view independently). In terms 
of probability theory, we are interested in methods which approximate the 
joint probability distribution of people present in a scene across space and 


time. 


1.2 System Overview 


To address the challenges identified in Sect. 1.1 we introduce a novel prob- 
abilistic framework to approximate the distribution of people present in 
the scene across space and time. The proposed approach is schematically 
illustrated in Fig. 1.2. In contrast to the majority of existing multi-view 
people detection approaches in the literature, we do not rely on locally 
applied discriminative pedestrian detectors requiring a large scale labeled 
data set. Instead, we formulate the problem of people detection in mul- 
tiple overlapping depth images as an inverse problem. Therefore, we 
introduce a generative scene model, which maps a scene state (in the 
following referred to as scene configuration) to a synthetic depth image 
in the perspective of each sensor (see Fig. 1.2 top center). The generative 
approach allows to effectively handle the different appearances’ of people, 
arising from (i) the change of viewpoint and (ii) the partial visibility of 


* In the computer vision literature appearance typically refers to the texture and color of an 


object. In contrast, in this context it refers to the appearance of an individual in a depth 
image. 


1.2 System Overview 


Gr(x+) G(x) Gs (x+) 


Sequence of multi-view frames 


. 0.4 06 
„sensar3 (osr) 1 | Sensor 3 (oar) 1 Probability 


Figure 1.2 Overview of the proposed approach. A time sequence of foreground segmented 
multi-view depth images from three sensors is used as input (left). The generative scene 
model generates synthetic depth images with respect to the given intrinsic and extrinsic 
sensor parameters (middle). The output of the stochastic inference are discrete probability 
maps representing the probability of people present on the ground level plane across time 
and space (right). 


people, e.g. due to occlusion or the limited field of view. This yields a 
viewpoint independent detector without the need of a training data set. 
The generative model is used in a probabilistic framework which leverages 
the full multi-view information given in the overlapping image regions 
for joint probabilistic people detection. 

Since in general we have access to a sequence of consecutive temporal 
frames, we extend this framework to additionally consider the temporal 
context. A common way to leverage temporal information is to use the 
detections obtained from an isolated multi-view frame as input for an off- 
the-shelf tracking-by-detection approach to get smooth person trajectories. 
Nonetheless, those methods do not take advantage of the full temporal in- 
formation since the tracking component operates on a lossy representation 
of object detections and does not have access to the joint distribution of 
objects in the scene. In contrast, our goal is to avoid the loss of information 
by taking into account the temporal image evidence from all sensor-views. 
Therefore, we introduce a probabilistic dynamics model to express the 
probability flow over time. In consequence, we are able to define the full 
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joint distribution of people present in the scene across all sensor views and 
time steps. 

For inference, we propose two different methods. First, we present a 
maximum a posteriori (MAP) method for people detection. Inference is ob- 
tained by a continuous optimization method, which hinges on a good ini- 
tialization. The MAP approach reveals the shortcomings of point estimates 
in comparison to methods that explore the full posterior distribution of 
people present in the scene. Second, as our main contribution we present a 
variational approach. Discretization of the ground level plane (see Fig. 1.2 
right) enables an effective estimate of the posterior distribution. Instead of 
just estimating an MAP point estimate, we apply mean-field variational in- 
ference to approximate the desired joint probability distribution of people 
present in the scene. 


1.3 Contribution 


The idea of generative probabilistic modeling has been successfully ap- 
plied to the task of multi-view people detection with monocular video 
cameras by Fleuret et al. [38]. However, to the best of our knowledge, 
probabilistic modeling in combination with a generative scene model has 
not yet been studied in the context of people detection in overlapping 
depth images. In particular, the scientific contribution of the present thesis 


can be summarized as follows: 


= We introduce a probabilistic model for multi-view people detection 
in overlapping depth images, including a generative scene model 
based on an efficient 3D person shape model. 


= For continuous inference we deduce the MAP objective and show 
how to practically solve the resulting non-linear least squares opti- 
mization problem by leveraging approximate differentiable render- 
ing [68]. 


1.3 Contribution 


= Our main contribution introduces mean-field variational inference 
to approximate the posterior distribution of people present in the 
scene. In contrast to Fleuret et al. [38] we use depth images as 
evidence and therefore are able to employ a richer generative scene 
model. We also propose a novel strategy to approximate the final 
mean-field update expectation by making use of geometric scene 
knowledge and a pre-computed visual dictionary. 


= We propose a novel extension to incorporate the temporal context 
into the mean-field optimization. To that end, we (i) present a prob- 
abilistic grid-based dynamics model to define the joint distribution 
across space and time; and (ii) deduce the mean-field variational 
inference update equations to efficiently approximate the desired 
probability distributions. 


= For comparison with our probabilistic inference methods, we in- 
troduce an end-to-end multi-view CNN architecture. The CNN 
architecture is only trained with synthetic depth images due to fair 
comparison with the proposed generative methods. 


= To the best of our knowledge, no publicly available data set covering 
the scenario of top-view people detection in a depth sensor network 
currently exists. We introduce a novel data set for indoor people 
detection in multiple overlapping top-view depth images. 


= For evaluation, we compare our approach with state-of-the-art 
monocular multi-view people detection methods. We report state- 
of-the-art results for the proposed mean-field variational inference 
methods on the aforementioned data set. Furthermore, we demon- 
strate that our approach (compared to the mono-view setup) success- 
fully exploits the multi-view image evidence and robustly converges 


in only a few iterations. 


1 Introduction 


1.4 Thesis Outline 


The present thesis is structured as follows: Chapter 2 provides an overview 
of the most relevant literature related to the general task of multi-view 
people detection. In Chapter 3 two essential fundamentals for the present 
work are explained. First, the method of mean-field variational inference 
(Sect. 3.1) is elaborated. Building on that, the probabilistic occupancy map, 
which is methodically strongly related to the present work, is explained 
in detail (Sect. 3.2). In Chapter 4 the probabilistic model, including a gen- 
eral part (Sect. 4.2), as well as manifestations for continuous latent space 
(Sect. 4.3) and discrete latent space (Sect. 4.4), is introduced. Subsequently, 
in Chapter 5 different inference methods regarding the proposed proba- 
bilistic model are discussed. As introduction, the MAP objective for the 
continuous latent space is deduced (Sect. 5.1). Practical aspects regarding 
the resulting continuous non-linear least squares optimization problem 
are discussed in Sect. 5.1.2. In Sect. 5.2 the main contribution of this the- 
sis is presented. The mean-field update equations are deduced and it is 
shown how they can be approximated in real-world applications. Com- 
plementary to the probabilistic inference methods an end-to-end CNN 
inference method is presented in Sect. 5.3. In Chapter 6 the results of our 
experiments are discussed. This includes the introduction of a novel data 
set (Sect. 6.1), comparative evaluation of the proposed probabilistic people 
detection approach in discrete latent space (Sect. 6.4), as well as qualitative 
results for MAP inference in continuous latent space (Sect. 6.6). Finally, 
Chapter 7 concludes the thesis and provides an outlook on future research 


directions. 


2 Related Work 


Multi-camera people detection has been extensively studied in the context 
of video surveillance. The vast majority of the existing approaches is 
based on multiple monocular video cameras observing an outdoor scene. 
However, the topic of indoor people detection in multiple depth images, 
especially from the top-view, has not yet been explored in detail. Therefore, 
we discuss the related task of people detection in multiple monocular 
video cameras. Since many multi-view methods accomplish detection and 
tracking by fusing local detections or local tracklets into a common world 
coordinate system, we will also briefly focus on related single-view people 
detection approaches in Sect. 2.1. 

Besides the major categories (single and multi-view), we categorize 
the literature in methods utilizing monocular video cameras (RGB-based 
approaches, or intensity based approaches) and depth sensing cameras 
(depth-based approaches). Since many depth sensors also provide RGB 
information, many hybrid approaches exist (RGB-D), which will be cate- 
gorized under the depth-based approaches. 

Following the aforementioned categorization of RGB-based and depth- 
based approaches, we will discuss multi-view approaches in Sect. 2.2. In 
Sect. 2.2.1 we comprehensively discuss methods focusing on pedestrian 
detection with multiple monocular cameras, and their relation to our 
approach. In Sect. 2.2.2 we focus on multi-view people detection methods 
based on depth information. Finally, we reveal the contribution in relation 
to the existing literature in Sect. 2.3. 

The characteristics of the approaches examined in this chapter are sum- 
marized in Tab. 2.1. The binary attribute top-view indicates if the corre- 
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sponding method is evaluated on top-view images in the original publica- 
tion. The attribute generative indicates if some kind of generative modeling 
is used for the detection component. For the multi-view approaches the 
additional binary attribute joint detection indicates that the multi-view 
detections are obtained jointly, rather than performing detection on each 
view independently. This chapter is an extension of previously published 
work [108, Ch. 2] by the author. 


2.1 Single-View People Detection 


The problem of pedestrian detection and tracking from the frontal or 
profile view has been examined in detail in the literature. In this work, 
however, we focus in particular on sensors having a top-view (also re- 
ferred to as overhead view in the literature) on the scene. As set forth in 
Sect. 1.1, people detection from the vertical top-view implies new chal- 
lenges (e. g. the drastically varying appearance of individuals depending 
on the view-point) compared to the frontal and profile view. While the de- 
tection from the top-view becomes increasingly important in the computer 
vision community, it still remains a niche research area. In the following 
we will therefore focus in particular, but not exclusively, on single-view 
people detection from the top-view. For a comprehensive survey of people 
detection methods in particular specialized on top-view approaches we 
refer to [3]. 


2.1.1 RGB-Based Approaches 


The problem of people detection and tracking in a single RGB image 
has been extensively investigated [54, 62, 66, 73, 104]. However, in this 
section we will in particular focus on the rather rare approaches which are 
addressing the problem of people detection from the top-view ina single 
RGB image. 
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Table 2.1 Characterization of related single and multi-view people detection approaches. 
The column top-view indicates if a method is in particular evaluated with views from the 
vertical top-view in the original paper. Parentheses indicate that an attribute cannot be 
assigned unambiguously. 


Top Gener Joint Method 

Modality -view -ative detection keyword 
Single-View 
Ahmed et al. [4] RGB "4 x - HOG +SVM 
Ahmad et al. [1] RGB Y x - CNN 
Ahmad et al. [2] RGB Y x - CNN 
Ahmed et al. [5] RGB Y x - CNN 
Bagautdinov et al. [8] Depth X Y - Probab. (MF-VI) 
Hacinecipoglu et al. [45] Depth (Vv) x - VHF + SVM 
Liu et al. [67] Depth x x - CNN 
Tian et al. [95] Depth x x - CNN 
Li et al. [61] RGB-D x x - CNN 
Rauter [80] Depth Y x - SVM 
Ertler et al. [33] RGB-D Y x - CNN 
Fuentes-Jimenez et al. [41] Depth Y x - CNN 
Sun et al. [92] RGB-D Y x - Blob detection 
Carletti et al. [19] Depth "4 X - Blob detection 
Multi-View 
Xu et al. [101] RGB x x Y Probabilistic 
Sankaranarayanan et al. [88] RGB x x x Homography 
Khan et al. [53] RGB x x Y Homography 
Eshel et al. [34] RGB (v) x Y Homography 
Peng et al. [75] RGB x x (v) Probabilistic 
Fleuret et al. [39, 38] RGB x Y Y Probab. (MF-VI) 
Alahi et al. [6] RGB x Y Y Sparse opt. 
Baqué et al. [10] RGB x (v) Y CNN + CRF 
Chavdarova et al. [23] RGB x (v) Y CNN 
Hou et al. [47] RGB x x Y CNN 
Tang et al. [93] RGB x x x CNN 3D pose 
Chen et al. [24] RGB x x x CNN 
You et al. [102] RGB x x Y CNN 
Ge et al. [42] RGB x Y Y Probab. (MCMC) 
Zhang et al. [103] RGB x x Y CNN 
Castellano et al. [21] RGB (v) x Y CNN 
Saputra et al. [89] RGB-D x x x Kinect skeleton 
Sun et al. [91] RGB-D x x x Kinect skeleton 
Munaro et al. [69] RGB-D x x x HOG 
Carraro et al. [20] RGB-D x x x CNN 3D pose 
Tseng et al. [96] Depth v X Y 3D image stitching 
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Ahmed et al. [4] propose a geometric normalization with respect to the 
optical center in the image to standardize the appearance of individuals. 
For detection the authors follow the fundamental ideas of [27] and train 
a Support Vector Machine (SVM) with histogram of oriented gradients 
(HOG) features. 

More recent methods incorporate deep learning frameworks to accom- 
plish people detection from the top-view. In [1] the pre-trained Single 
Shot Multi Box Detector [65] is applied to top-view people detection in a 
single RGB image. Ahmad et al. [2] follow a similar idea and apply the 
pre-trained YOLOv3 Convolutional Neural Network (CNN) [81, 82] to the 
task of top-view people detection. In [5] these ideas are further developed. 
The authors introduce a top-view people detection and tracking method 
based on recent deep learning architectures. For detection the YOLOv3 
architecture is re-trained with top-view images. The tracking is based on 
Simple Online and Realtime Tracking (SORT) [100]. 


2.1.2 Depth-based Approaches 


We will first discuss depth-based methods for people detection in the 
general frontal or profile view. Bagautdinov et al. [8] introduce DPOM, a 
probabilistic occupancy map for occluded single depth images (cf. Fleuret 
et al. [38]). The authors propose a probabilistic generative model to define 
the expected distribution of depth values with respect to the presence of 
an individual. For inference a mean-field variational inference strategy 
is proposed. Hacinecipoglu et al. [45] introduce a pose invariant peo- 
ple detection approach, based on point cloud data. The point cloud is 
first clustered and head candidates are extracted. For classification an 
SVM with a Viewpoint Feature Histogram (VHF) [86] feature descriptor is 
trained. Recently CNN architectures [61, 67, 95] are successfully applied to 
single view depth image people detection, leveraging data sets with many 
labeled images for training. However, those mentioned approaches are 
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trained with data sets mostly containing people from the classical frontal 
or profile view but not from the top-view. 

As for monocular RGB approaches, another class of methods focuses 
on top-view people detection. Rauter et al. [80] introduce a novel feature 
descriptor based on local depth differences for top-view people detection. 
The feature descriptor is used to train an SVM for head-shoulder detection. 
To reduce the computational complexity, the resulting detector is only 
applied to potential head candidates, determined by a local maxima search 
in the depth image. More recent work by Ertler et al. [33] fuses depth 
and RGB data by combining two CNN streams in a Faster R-CNN [83] 
architecture. The authors propose a late and mid-layer fusion and report 
comparative evaluation results for the different fusion methods. Recently, 
Fuentes-Jimenez et al. [41] introduce the CNN architecture DPDnet. The 
proposed encoder-decoder architecture takes a depth image as input and 
predicts a confidence map (in image coordinates), where each detection is 
represented by a 2D normal distribution. 

The related problem of people counting with a single depth camera from 
the top-view has been studied in great detail [19, 30, 92, 105]. In contrast 
to our proposed method, those approaches focus on integrated systems 
counting the number of persons crossing a certain virtual line, providing 
people detection implicitly and in a rather small area. 

For a comprehensive survey of people tracking in a single RGB-D view 
(including, but not limited to the top-view) we refer to Camplani et al. [17]. 
Complementary, we refer to Liciotti et al. [63] for a narrower review of 
top-view people detection and counting with RGB-D sensors. 


2.2 Multi-View People Detection 


2.2.1 RGB-Based Approaches 


In this section we focus on people detection with multiple monocular 
cameras. In order to restrict our scope, we do not consider methods 
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working across non-overlapping views [13, 79, 85] but rather focus on 
methods utilizing overlapping views. For an exhaustive survey of multi- 
camera people detection and tracking, we refer to [46, 48, 98]. 

Since people detection and tracking in single-camera views have been 
intensively studied [66, 73, 104], many methods accomplish multi-view 
detection and tracking by fusing local detections or local tracklets into 
a common world coordinate system [7, 52, 101]. Since the detection is 
performed independently for each view, those methods do not take full 
advantage of the multi-view information and thus are often not able to 
resolve occlusion and measurement noise. Besides, the vast majority of 
employed pedestrian detectors is optimized to detect people in frontal or 
profile view (cf. Sect. 2.1). 

Homography based approaches project local image features from each 
sensor into a common plane to perform global detection [88]. In [53] 
a homographic occupancy constraint is proposed to handle occlusion and 
detect people on a common scene plane. Eshel et al. [34] propose a similar 
approach, projecting the foreground pixels of all views into a common 
height plane for head detection. In [75] those approaches are extended by 
a multi-view Bayesian network in order to avoid false positive detections 
arising from occlusion artifacts. 

Another class of related approaches addresses the problem of multi- 
camera detection by employing a generative model to jointly take ad- 
vantage of the image evidence of all available views. Fleuret et al. [39] 
introduce the idea of estimating the probability of people present on the 
ground level plane by a binary generative scene model. The authors ex- 
tend their approach and publish it together with an integrated tracking 
algorithm in [38]. Since then, the method is referred to as Probabilistic Oc- 
cupancy Map (POM) in the computer vision literature. POM takes binary 
foreground images as input and compares them with synthesized binary 
masks. The binary masks are obtained by a generative model, which 
represents individuals by a simple rectangular box. The final occupancy 
map is estimated by mean-field variational inference. For the last decade, 
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POM served as a standard method for multi-view people detection and 
has been evaluated on many standard benchmarks like PETS [12]. Since 
the method proposed in this thesis is heavily inspired by POM, it will be 
discussed in detail in Sect. 3.2. Alahi et al. [6] also propose a generative 
method for people detection in a camera network. They recast the task 
as a linear inverse problem, regularized by a sparsity constraint on the 
occupancy grid. Other than in [38], a silhouette person model is proposed. 
Unlike our approach, both methods utilize only 2D models and fit them to 
a binary foreground mask. 

Due to the massive availability of labeled RGB training data, more 
recent methods employ CNN architectures for multi-view people detection. 
Baque et al. [10] introduce an end-to-end multi-view people detection 
architecture. They propose to use the generative model from POM [38] 
in combination with a discriminatively trained CNN. In order to resolve 
ambiguities arising from occlusion, the interaction between grid cells 
is taken into account by Conditional Random Fields (CRFs). The final 
CNN /CRP architecture can be trained end-to-end by exploiting back mean- 
field techniques [106]. Chavdarova et al. [23] present a CNN architecture 
enabling end-to-end multi-view probabilistic occupancy map estimation. 
To overcome the lack of an appropriate multi-view data set, an existing 
monocular pedestrian data set [32] is used. 

In contrast to [10, 23], Hou et al. [47] introduce an anchor-free end-to-end 
multi-view CNN architecture without using generative models or CRFs. 
The multi-view image evidence is fed into a CNN with shared weights 
among the views. Intrinsic and extrinsic camera calibrations are used to 
explicitly project each feature map to the ground plane. The projected 
feature maps are further aggregated by convolutions, finally predicting the 
occupancy map on the ground plane. The authors report state-of-the-art 
results on the challenging WILDTRACK data set [22]. 

Other research groups focus on integrated tracking and detection sys- 
tems, leveraging deep learning methods for detection and tracking. Tang 
et al. [93] propose joint multi-view people tracking and pose estimation. 
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They apply a classical tracking-by-detection scheme on each single view. 
To associate tracklets across views, appearance and semantic features are 
used. In a second step single-view human 3D pose estimation is applied 
based on OpenPose [18]. Finally, the resulting 3D poses are fed back to the 
multi-view tracker in order to improve the tracklet association across multi- 
ple views. Chen et al. [24] propose a real-time people tracking-by-detection 
system. For detection a computationally demanding global CNN detector 
is applied to key-frames of each view interdependently. To get fast detec- 
tions between key-frames a local CNN detector is combined with classical 
motion prediction. You et al. [102] present Deep Multi-Camera Tracking 
(DMCT), an end-to-end multi-view tracking and detection pipeline, which 
focuses on real-time applications. 

A related class of approaches addresses the problem of multi-view 
crowd counting. Ge et al. [42] propose a generative probabilistic approach, 
using binary foreground masks as input. While the method is related 
to POM [38], the authors propose a sampling based strategy based on 
the Reversible Jump Markov Chain Monte Carlo (RJMCMC) [43] sampler 
instead of mean-field variation inference. This yields the advantage that a 
more complex posterior distribution can be approximated, in particular 
the global optimization can be obtained in continuous location space. 
More recent work focuses on deep learning techniques for crowd counting. 
Zhang et al. [103] introduce an end-to-end multi-view CNN architecture 
to predict the density map on the ground plane of the scene. For multi- 
view fusion the authors propose different late and early fusion strategies 
and report comparative results on different publicly available data sets. 
Castellano et al. [21] propose an alternative CNN architecture, focusing on 
the special demands of crowd counting from unmanned aerial vehicles. 


2.2.2 Depth-based Approaches 


In contrast to the RGB methods mentioned above, only a few existing 
approaches rely on multiple depth images for people detection. In this sec- 
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tion we will first discuss methods focusing on pedestrian detection from 
the frontal or profile view. Those approaches follow the same paradigm 
and apply a classical single-view people detector to each RGB-D view in- 
dependently [20, 69, 89, 91]. Subsequently, we will discuss methods which 
explicitly focus on top-view people detection and tracking in multiple 
depth images [64, 96]. While the former group of approaches in general 
leverages RGB and depth data (RGB-D), the latter only makes use of the 
depth information. 

In early research Saputra et al. [89] propose indoor people tracking using 
two Kinect v1 sensors’. For people detection and tracking the authors 
apply human segmentation and the skeleton tracking, obtained by the 
Kinect for Windows SDK v1’, to each view independently. 

Following these ideas, Sun et al. [91] also employ a network of Kinect 
v1 sensors for people detection. Similar to [89], the human skeleton track- 
ing from the Kinect for Windows SDK v1 is applied to each single view 
independently. The Kinect for Windows SDK v1 can only keep track of the 
skeletons of two individuals at the same time. To overcome this limitation 
a skeleton interleaving strategy is proposed to handle up to six individuals. 
Kalman filter based tracking is applied to foot points of individuals present 
in a single view. To handle occlusions, a trajectory matching scheme is 
proposed, in order to fuse trajectories from the same person (obtained by 
different views) to a single, global trajectory. 

Munaro et al. [69] introduce the OpenPTtrack framework for people 
tracking in an RGB-D camera network. The project aims for an integrated, 
easy-to-use software system, including different camera calibration al- 
gorithms as well as sensor modalities. The two-stage people detection 
algorithm is performed on each view independently. First, a depth-based 
clustering is applied on the point cloud data. Second, for each cluster the 


! The Microsoft Kinect v1 sensor is an active consumer RGB-D camera. The depth informa- 


tion is obtained by an actively emitted pattern in near-infrared, while an infrared camera 
captures the reflection of the pattern (structured light). 

https: / / www.microsoft.com/en-us / download /details.aspx?id=40278, accessed 
31.03.2021 
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corresponding image region in the RGB image is fed into a HOG-based 
people detector. The resulting detections obtained by each sensor are gath- 
ered by a centralized tracking node and a typical tracking-by-detection 
scheme is applied. 

Carraror et al. [20] propose an approach for human body pose estimation 
in a network of RGB-D sensors. To obtain a 3D skeleton, CNN-based 
pose estimation [99] is applied to the RGB images of each single-view. 
Incorporating the available depth data results in multiple 3D skeletons 
for each view. The single view 3D skeletons are further processed by a 
global node to solve the data association problem between the individual 
3D skeletons, providing multi-view pose estimation. 

In contrast to the aforementioned approaches, Tseng et al. [96] present 
an indoor people detection system based on multiple sensors in top-view. 
Their approach is based on a fused virtual top-view depth image, ob- 
tained by the point cloud of each sensor. For detection a hemiellipsoidal 
head model is employed to take advantage of the discriminative height 
difference near the head contour of a human. In contrast to previously 
mentioned methods, the authors rely on depth data only. Furthermore, 
the detection is applied to the fused depth image, leveraging the full multi- 
view image evidence. In [64] the authors extend their approach, and focus 
on people tracking in a hybrid overlapping and non-overlapping camera 
network setup. 


2.3 Integration of Present Thesis 


In this section we will classify the contribution of the present work with 
respect to the aforementioned literature. As pointed out in Sect. 2.2.1 the 
vast majority of approaches rely on monocular video cameras but not on 
depth sensors for multi-view people detection. While in this thesis we ad- 
dress a different setup (top-view people detection with depth sensors), our 
work is methodically strongly related to multi-view approaches following 
the idea of generative modeling. Namely, these are POM, introduced by 
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Fleuret et al. [38], crowd detection with an MCMC sampler proposed by 
Ge et al. [42] and sparsity driven people localization introduced by Alahi 
et al. [6]. While, compared to the present work, the two latter approaches 
use very different methods for inference, our work is methodically most 
similar to [38]. Although we share a comparable mean-field inference 
strategy, there are several distinctions: 


(i) our approach uses depth images as evidence and is therefore able to 
make use of a generative scene model based on a 3D shape model; 


(ii) we introduce a novel strategy to approximate the final mean-field 
update expectation by making use of geometric scene knowledge, a 
pre-computed visual dictionary and a weighted asymmetric image 
similarity; 


(iii) we propose to incorporate the temporal context jointly in the mean- 
field optimization to improve the detection performance. 


According to the sensor modality our work belongs to the category 
of depth based multi-view approaches. In contrast to our work the vast 
majority of approaches in this category [20, 69, 89, 91] are based on RGB-D 
data and focus on the classical frontal or profile view. Besides, all aforemen- 
tioned approaches apply single-view detection algorithms independently 
on each view and merge the local results to obtain global multi-view de- 
tections. Due to the single-view approach, they do not take advantage of 
the full multi-view information and heavily rely on the specific limitations 
of the employed people detector (e. g. viewpoint dependence). 

To the best of our knowledge the only approach explicitly addressing 
the problem of indoor top-view people detection in multiple depth im- 
ages is the work by Tseng et al. [96]. While our work addresses the same 
problem, their approach is methodically quite different. The authors fuse 
multiple depth views in a virtual top-view depth image and apply a dis- 
criminatively trained detector. While the proposed early fusion strategy 
successfully leverages the multi-view observations, evidenced by strong 
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detection performance, their approach is focused on high quality depth 
data obtained by active depth sensors. In contrast, our method is particu- 
larly suitable for low resolution passive stereo sensors. Furthermore, our 
method approximates the joint probability distribution of people present 
in the scene, while [96] provide a MAP point estimate. 

To summarize, methodically our work is highly inspired by Fleuret et al. 
[38], while from an application perspective it is most related to the work 
of Tseng et al. [96]. In contrast to recent data-driven CNN architectures [20, 
23, 61, 67, 95] our method requires no training data and the detection con- 
fidence can be quantified more precisely by approximating the posterior 
distribution. 
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In this chapter we will set the methodical foundations for the present 
thesis. In Sect. 3.1 we will outline the general idea of variational inference. 
In particular, we focus on the mean-field approximation which serves 
as the basis of the inference strategy proposed in this thesis. Based on 
this, we will discuss the application of mean-field variational inference to 
the problem of multi-view people detection in binary foreground images, 
referred to as probabilistic occupancy map (Sect. 3.2). 


3.1 Mean-Field Variational Inference 


A major challenge of probabilistic inference is that in many real world 
applications the posterior distribution over some symbolic latent variables 
X given the observations O, 


p(4,0) 


pæ |0) = Be, 


(3.1) 


is hard to compute or even intractable. Although it is possible for typical 
probabilistic models to compute the joint distribution p(Y, O), the evidence 
or marginal data likelihood 


p(O) = fo oax (3.2) 


contains an integral (or a sum in case of discrete latent variables) over 
the full state space induced by Y, which is often intractable. The main 
idea of variational inference (VI) is to find a tractable proxy distribution 
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Figure 3.1 Schematic illustration of the variational inference objective. The green shape 
represents the space of all probability distributions, whereas Q represents a subset of possible 
proxy distributions (also referred to as q-family). The variational objective is to find a proxy 
distribution y € Q which is, with respect to some similarity measure, as close to the true 
posterior distribution p(Y | O) as possible. 


q (4) which is, with respect to some similarity measure, as close to the true 
posterior p( | O) as possible (see Fig. 3.1). The proxy distribution q(4) 
can in general be an arbitrary probability distribution and is not restricted 
to some particular parametric form. In contrast to parameter estimation 
methods such as maximum-likelihood estimation (ML) or maximum a posteriori 
estimation (MAP), Vlis a more general framework, since it allows to find 
the optimal distribution q by defining the objective as a functional rather 
than optimizing the parameters of a distribution explicitly. 

The variational methods used in this work are part of the mean-field 
theory [74], which has its origin in statistical physics. The mean-field 
theory is applied in many scientific areas, in particular in statistical field 
theory, for example to understand ferromagnetism (cf. Ising model [25, 
49]). In the last decades these methods haven been refined by the statistics 
and machine learning community [15, 28, 40, 50, 97] to make it applicable 
for general probabilistic inference. In this work we stick to the notation 
and conventions commonly used in the machine learning literature [11, 
70]. 
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To derive the objective for the optimal proxy distribution q one has to 
take further assumptions: First, one has to define a similarity measure be- 
tween the true posterior distribution p(X | O) and the proxy distribution 
q(X). The most popular choice is to use the Kullback-Leibler divergence 
(KL divergence, see Sect. 3.1.1). Following this assumption in Sect. 3.1.2, 
the general objective for KL variational inference is derived. Second, one 
has to take some assumptions on the structure of the proxy distribution 
q. In this work we use the so called mean-field assumption, which assumes 
that q factorizes over its marginal distributions. In virtue of this assump- 
tion it is possible to derive an optimization scheme, where the marginal 
probabilities q(x;) are updated iteratively based on the current state of all 
other latent variables {x} : i 4 j}, see Sect. 3.1.3. 


3.1.1 Kullback-Leibler Divergence 


The Kullback-Leibler divergence [60], also known as relative entropy [26, 
p-19], has its origin in information theory. In the context of probability 
theory it is often interpreted as a measurement of dissimilarity between 
two probability distributions. Let p, q be the probability density functions 
of a continuous random variable, then the KL divergence is given as 


EUN ES 
Kup a) = fotos (FE) ax, (53) 
which can also be written as 
Kup | =- f saos ($27) ax. 64) 


For discrete probability distributions one has to replace the integral by 
the corresponding sum. The KL divergence is only defined if q(X) = 0 
implies p(W) = 0. However, a common convention in the literature is that 
the KL divergence is set to infinity KL(p || q) = oo if there is any Y such 
that g(4) = 0A p(X) > 0 (cf. [26, p. 19]). It can be proven that the KL 
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divergence is non-negative KL(p || q) > 0 for any p,q and is zero if and 
only if p = q (cf. [26, p. 28] and [11, p. 170]). It is important to notice that 
the KL divergence is asymmetric and in consequence does not fulfill the 
requirements of a distance metric. 

An alternative perspective on the KL divergence follows trivially by 
applying the definition of expectation to (3.3), given as 


= p(x) 
ko = (ore) 65) 
= (log p(4) — log (A) oía) > (3.6) 


where (+) (x) denotes the expectation with respect to the distribution p(X). 


3.1.2 KL Variational Inference 


Since the KL divergence is asymmetric, the order of the arguments does 
have an impact on the variational objective. In the literature KL(p || q) is re- 
ferred to as forward KL divergence and KL(g || p) as reverse KL divergence 
respectively. Inspecting the definition of the forward KL divergence 


«|o 
KLAR |) pae) = [male (PLP?) ax, — en 
one can see that the forward KL divergence is infinity if q(4) = 0 and 
p(X |©) > 0. Thus, if using the forward KL divergence as an objective, 
q(X) gets forced to be non-zero if p(Y | O) > 0. On the other hand, the 
reverse KL divergence 


Ka) Lot 10 = f actos (os) a (38) 


has the opposite effect. It approaches infinity if p(X | O) = 0 and q(&) > 
0. Following the same argument as above, q(X) gets forced to be zero 
for p(X |©) = 0. One can conclude that when fitting a distribution q 
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— p(x) =A 
Es I2) 0.18 —= easi 


(a) KL(p || q) (b) KL(q || p) 


Figure 3.2 Comparison of the results when fitting a single univariate normal distribution 
qtoaunivariate bimodal distribution p with respect to the forward (a) and the reverse (b) 
KL divergence. The exemplary "true" distribution p(.X ) (blue curve) is given as a mixture 
of Gaussian p(Y) = 0.5(N (X | —u,0)) +N (X | 1, o)) with u = 8,0 = 2. (a) shows the 
optimal fitting normal distribution q with respect to the forward KL divergence, given by 
the moments of p(.X ). (b) shows q with respect to the reverse KL divergence respectively. 
In contrast to (a), the optimal choice of q is not given in closed form for the reverse KL 
divergence. In this example the optimum is to locally fit q to one of the modes of p (example 
based on [11, p. 619 ff.]). 


to a distribution p with respect to the forward KL divergence, q is in 
general overestimating the support of p. In contrast, using the reverse KL 
divergence is potentially underestimating the support of the true posterior 
p. 

Studying Fig. 3.2 and following the arguments in [70, p. 733 f.] one 
can see that in a scenario where p is a multimodal and q a unimodal 
distribution, using the forward KL divergence is problematic. Since q is 
forced to be a unimodal distribution, the mode of q is somewhere between 
the modes of p, which is a region where p has rather low probability 
density. 

Another aspect is the tractability of the resulting optimization problem. 
Using the forward KL divergence one has to compute the expectation with 
respect to the intractable distribution p(X) which is in general hard to 
compute. However, using the reverse KL divergence yields the advantage 
that expectation is computed with respect to the simpler proxy distribution 
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q(X). Depending on the choice of the q-family (thus the structure of q) 
this leads to a feasible optimization problem (see Sect. 3.1.3). Therefore, it 
is common to use the reverse KL divergence to formulate the variational 


objective 

ar) = A [px 10), (3.9) 
with the KL divergence 

KL(a(4) | p(X | O)) = (log q(¥) — logp( | O)) ax) (3.10) 


= (log q(#)) a, — log (A | O)) uy. (811) 


Expanding the expectation over the conditional distribution we can unveil 
the dependence of the evidence p(O) 


logp(X | Oax) = Uosr(#.O))q¢x — Moen) xy (612) 
= (logp(4, O)) x) — logp(O). (3.13) 


Inserting into (3.11) we can rewrite the KL divergence as 


KL(a(4) || p(X | O)) = og (4) acy — (log p(¥,O)) qu, 
eS 
KL(a(x) | p(x,0)) (3.14) 
+logp(O). 
Since log p(O) does not depend on X it can be treated as a constant 
and does not affect the solution of the objective. Thus minimizing the 


original KL divergence (3.11) is equal to minimizing the KL divergence 
KL(q(A) || p(X,©)). To get further insights we rearrange (3.14) to 


log p(O) = KL(a(4) || p(& | O)) — (log a( 4) qx) 


(3.15) 
+ (log p(¥,O)) ua): 
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log p(O) 


£(q) (ELBO) € 0 


Figure 3.3 Graphical illustration of (3.15). Maximizing the ELBO is equivalent to minimiz- 
ing the KL divergence KL(g(X) || p( | O)). The KL divergence can also be interpreted as 
the error between the lower bound £(q) and the actual evidence log p(O). Figure inspired 
by Bishop [14, Fig. 9.11]. 


Since KL(- || -) > 0, (3.15) implies a lower bound £(g) for the marginal 
data likelihood 


logp(O) > —(log q(X)) x) + (log p(¥,O)) a) - (3.16) 
A A E E 


L(q) 


In Fig. 3.3 a graphical illustration of (3.15) and (3.16) is given. In the 
machine learning literature the lower bound £(g) is referred to as evidence 
lower bound (ELBO). Notice that the definition of the ELBO is related to the 
concept free energy in statistical thermodynamics, therefore — £ (q) is also 
referred to as variational free energy (cf. [70, Sect. 21.2.1]). As a consequence 
of definition (3.4) the ELBO can also be written as 


L(a) = -KL(a(&) | p(4,0)). (3.17) 


Maximizing the ELBO £(g) with respect to q has two relevant aspects: 
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(i) In face of (3.15) and (3.17) one can see that the original objective (3.9) 
is equivalent to the objective 


q = arg max £ (q). (3.18) 
q 
Thus maximizing the ELBO (3.16) leads to the desired approxima- 
tion of the conditional distribution p(¥ | O). 


(ii) As a direct consequence of the definition (3.16), maximizing £(q) 
yields a lower bound on the marginal data likelihood p(O). This is 
particularly useful for Bayesian model selection (cf. [11, p. 619]). 


3.1.3 Mean-Field Approximation 


Having defined the variational objective in the previous section as maxi- 
mizing the ELBO £(g) with respect to some distribution q(X ), one still has 
to restrict the set Q of possible distributions q € Q to make the optimiza- 
tion feasible. In general, it is possible to do so by (i) assuming that q is of a 
specific functional or parametric form; (ii) assuming that q has a specific 
structure, i.e. defining how q factorizes over the latent variables X. While 
the first assumption (i) is more restrictive, (ii) is more general and exploits 
the full power of the variational approach. Notice that it is also possible to 
combine both assumptions. 

In this work we apply the so called mean-field assumption. While in 
general this assumption allows factorizing q(¥) over arbitrary disjoint 
partitions of latent variables [14, p. 464], in this work we apply the mean- 
field assumption in its simplest form'. We assume that q(X) is a fully 
factorized distribution, given as product over its marginal probabilities 


ax) =]] az), (3.19) 
i=l 


! Sometimes referred to as naive mean-field assumption in the literature [11, p. 623]. 
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where r is the number of latent variables. Even though this is a strong 
simplification on the structure of possible distributions q, there are no 
restrictions on the functional form of the individual distributions q;. In 
practice, it turns out that this simple factorial structure leads to a tractable 
variational objective for many probabilistic models. 

In order to deduce the mean-field update equations one needs to define 
a probability distribution over X except of one single element z; € X. 
Therefore, let 


arxz)= [[. ate) (3.20) 


j=L:jHt 


denote the mean-field distribution excluding the element ;. 


3.1.3.1 Mean-Field Update Equations 


Using the results from the previous section the general objective is to 
maximize the ELBO with respect to the function q, 


à = arg max L(q). (3.21) 
q 

Applying the mean-field assumption, one possible optimization technique 
is to apply a coordinate descent strategy, where one marginal distribution 
q; gets updated, while the other variables stay fixed. To derive the corre- 
sponding update equations one can isolate the dependency of one single 
q;. Following the detailed derivation in appendix A.1, it follows that the 
ELBO Z(q;) for a single q, can be written as a negative KL divergence 


L(q;) = -KL(q;(x;) | Bi(X,©)) + const, (3.22) 
with the distribution 

y 1 

P(X, 0) = z exp((logr(¥,O))q(x\21)) (8.23) 
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where Z; is the partition function and the notation (-) (y. ) refers to the 
expectation with respect to the distribution q(-) for all {x} : i 4 j}: 


(log p(4,0)), enzo = fogo, O) II q;(x;)dXNa;. (8.24) 
j-1ljzi 


Up to a constant term, £(q;) (3.22) is equivalent to a negative KL diver- 
gence between the desired distribution q;(2;) and a distribution propor- 
tional to exp (log D(X, e (Ma!) y According to (3.22) we can maximize 
the ELBO for one single distribution q;(z;) by minimizing KL(q; || p;). It 
trivially follows that the optimal d; (x) with respect to the objective 


G = arg max £(q;) , (3.25) 
di 
satisfies q; = p;. This finally leads to the so called general mean-field equa- 
tions [11, p. 625 ff.] 


lat) = g exp (log p(X,O)) gu) 


(3.26) 
x exp( (log (X, O)) e Na!) Tr 


for all marginal distributions q, with i € (1,...,r). Fixing all other latent 
variables X \ z;, this equation updates a single distribution q,(x,) with 
respect to the current mean-field state q(A \ x;). This is a general result 
of the mean-field assumption and does not hinge on a specific paramet- 
ric distribution family of q;. In practice the resulting parametric form of 
the marginal distributions q; and the tractability of the mean-field up- 
date equations depend in particular on the probabilistic model p(X , O). 
Thus one essential part of mean-field variational inference is to define 
a factorization for q and a probabilistic model p(-) leading to tractable 
mean-field update equations. Although now we have an individual equa- 
tion for each marginal distribution q;, in general we cannot write this as a 
closed form solution since each equation depends on the states of all other 
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marginal distributions q(Y V z;). This commonly leads to an iterative 
update scheme, where one distribution q, is updated with respect to the 
current state q(X N 25), which we will discuss in the following section. A 
detailed discussion on the general mean-field update equations can be 
found in [70, p. 736 ff.], [14, p. 465 ff.] and [15]. 


3.1.4 Coordinate Ascent Mean-Field Variational Inference 


In the previous sections we recast the general probabilistic inference prob- 
lem to an optimization problem, which is finally given by the general 
mean-field equations (3.26) for each marginal distribution q;. A popular 
algorithm to solve this optimization problem is referred to as Coordinate 
Ascent Variational Inference (CAVI) [15]. In order to decouple the depen- 
dence of the marginal distributions, CAVI iteratively updates one marginal 
distribution q, after the other, while all other distributions q, : j # i stay 
fixed. Hence, each q, is updated with respect to the current mean-field 
state q(¥ | x;) as presented in Algorithm 1. In practice it is not necessarily 


Algorithm 1 Coordinate Ascent Mean-Field Variational Inference (CAVI) 
1: procedure OPTIMIZE Q (Data: O, Model: p(X,©)) 


2: q(&)  init() > init mean-field 
3: while L (q) has not converged do 

4: for alli € {1,...,r} do > iterate over marginals 
5: qi < exp (logp(Y, O)) eva!) 

6: 

7: a) — qi > asynchronous update of MF state 
8: end for 

9: £(q) + (logp(4.O))qy — (log a (AY) qv) > compute ELBO 


10: end while 
11: return q 
12: end procedure 


needed to use the ELBO £(g) as a termination condition of the optimiza- 
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tion. It is often sufficient to monitor the convergence behavior of the 
marginal distributions q;. 

The CAVI algorithm relies on asynchronous mean-field updates (3.26), 
which means that the individual q;(z;)'s are updated sequentially in each 
iteration. It can be proven that updating each q; asynchronously gradually 
improves the approximation for each q; (cf. [11, p. 626]). In contrast, it 
is also possible to execute the mean-field updates synchronously. In a 
synchronous update iteration all the q, are updated simultaneously, using 
the same mean-field state obtained from the previous iteration. While 
synchronous mean-field updates can be easily parallelized, the conver- 
gence guarantees from the asynchronous updates do not hold anymore. In 
general synchronous updates can lead to oscillating effects during the op- 
timization. In Sect. 6.4.1.2 we report an example of this effect, in particular 
illustrated in Fig. 6.7. 


3.2 Probabilistic Occupancy Map (POM) 


In 2008 Fleuret et al. [38] introduce the probabilistic occupancy map (POM). 
For almost a decade, POM served as the state-of-the-art method for multi- 
view people detection in RGB images. Although recent state-of-art meth- 
ods such as Deep Occlusion [10] outperform classical POM, Deep Occlu- 
sion still partially relies on the generative model introduced by POM. Since 
the present work is inspired by POM, in this section we will discuss the 
technical details of POM and reveal the relations to this thesis. 

POM solves the problem of multi-view people detection as an inverse 
problem. Multiple overlapping foreground-segmented binary images are 
used as input and compared against synthesized binary images obtained 
by a generative model. Each individual person is expressed by an axially 
parallel rectangular box (see Fig. 3.4(a)) in order to generate synthetic 
foreground images. For inference the KL divergence between the true 
posterior distribution and a fully factorized distribution is minimized by an 
iterative optimization algorithm. Even though in the original work [39, 38] 
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the authors do not refer to the method of mean-field variational inference, 
the update equations derived in [38, Eq. 25] satisfy the general mean-field 
equations (3.26). To reveal the connection of POM to the present work 
we use the notation and terminology of mean-field variational inference 
introduced in Sect. 3.1 as well as the notation of the discrete probabilistic 
scene model introduced in Sect. 4.4. 


3.2.1 Probabilistic Model 


Let b = (b},...,bc)" be the input observations, given as foreground 
segmented binary images obtained from C cameras at one time step. Since 
POM does not depend on the temporal context we will omit the time 
index in this section to keep the notation clear. To define the occupancy 
grid, the plane at ground level is discretized into a regular 2D-grid of 
n locations. Each grid location u; will be assigned a realization x; of a 
Bernoulli random variable X; ~ B(u;), where u; denotes the probability 
of a person present at location u;. The occupancy map is represented as 
the vector x = (x,,.--,2,,)" € (0, 1}” and is also referred to as a (discrete) 
scene configuration in this thesis (cf. Sect. 4.4). The joint distribution over 


the observations and scene configuration is given as 
p(b, x) = p(b | x)p(x). (3.27) 


Fleuret et al. [38] employ two independence assumptions to make the 
distribution (3.27) tractable (cf. [38, Sect. 5.1]). 
First it is assumed that the prior p(x) factorizes as 


p(x) = |] (2). (3.28) 


This implies the assumption of individuals moving around independently 
of each other. While in reality this assumption does not hold e. g. since 
individuals are in general keeping a certain inter-person distance to each 
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other (cf. Sect. 4.3.2), it is a common assumption to keep the posterior 
distribution tractable. 

Second, for POM it is assumed that the statistical dependence between 
views is only given by the people present in the scene. Thus, if the scene 
configuration x is fixed, the views are assumed to be conditionally inde- 
pendent: 


C 
p(b |x) = [ [ne 15). (3.29) 
c=1 
Although this assumption does not cover global effects, such as illumina- 
tion changes etc., it can be justified by considering that b, are foreground 
segmented images, while people are assumed to be the only moving ob- 
jects in the scene. 


3.2.1.1 Generative Scene Model 


To be able to define the likelihood term p(b, | x) Fleuret et al. [38] propose a 
generative model A, (x), which maps a scene configuration x to a synthetic 
binary image in the perspective of camera c. It is assumed that the cameras 
used are extrinsically and intrinsically calibrated. To generate a synthetic 
image A.(x), fixed sized cuboids are placed in the 3D scene depending 
on the provided scene configuration x. Each cuboid is rendered into 
all views using the given extrinsic and intrinsic camera parameters. For 
each rendered cuboid the axis-aligned bounding box is computed and the 
corresponding pixels are classified as foreground pixels in the resulting 
synthetic images. In Fig. 3.4 the synthetic image generation process is 
illustrated in detail. 

Let AŻ be a binary synthetic image with only a rectangle placed at the 
grid location with index i, where all pixels inside the rectangle are equal 
to 1 (Fig. 3.4(a)). The synthetic image generation is formally given as 


A,(x) = Uii AL, (3.30) 
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(a) Al (b) 4. (x) 


(c) (d) 


Figure 3.4 Illustration of generative scene model producing binary images in the perspec- 
tive of camera c. Each gray dot represents the center of a grid cell on the ground plane 
assigned with a Bernoulli random variable X; (for visualization only). Each black rectangle 
in (a) and (b) represents an individual present at a particular grid cell. (a) shows a synthetic 
binary image with exactly one individual placed at grid location i. (b) shows a generated 
synthetic image corresponding to an exemplary scene configuration x with four non-zero 
entries. The dimensions of the rectangles are given as the axis-aligned minimum bounding 
box of a cuboid, roughly approximating the shape of a person, as shown in (c) and (d). 
Figures inspired by [38, Fig. 4,6] 


where the union U is defined as the binary image union (pixel-wise logical 
OR). An exemplary synthetic image A,(x) is given in Fig. 3.4(b). Based on 
this generative model, the likelihood for a single observation b, is given as 


p(b. | x) x exp(—Spom(be, Ac(X))) > (3.31) 
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where dpom(bc; A.(X))) is a pseudo distance function between an observed 
image b, and a synthetic image A.(x). Let a,b € [0, 1] *" be images with 
the dimension W x H pixels, the pseudo distance function is given as 


Ibo (1— a) - (1 — b) © all, 


lallı 


Spo (ba) = - ; (3.32) 
with ||-||, = ||vec(-)||, defined as the sum of all pixels (element-wise L1 
vector norm). The parameter € can be interpreted as a pseudo standard 
deviation, controlling the reliability of the measurements. Finally, the 
desired posterior distribution is given as 


p(x |b) = [Ie |x) Dit (3.33) 


x Teo Spom(be, Ac(x))) | [ p(x) - (3.34) 


i=1 


3.2.2 Inference 


Due to the high dimensional latent space (0, 1)” the posterior distribution 
(3.33) is intractable. Fleuret et al. [38] propose to apply mean-field vari- 
ational inference (cf. Sect. 3.1) to approximate the posterior distribution 
p(x | b) with a simpler proxy distribution q(x) = ]];_, q:(£;) (mean-field 
assumption). The objective is given as 


q(x) = engin KL(q(x) || p(x | b)). (3.35) 


According to the general mean-field equations (3.26) the optimal update 
for the marginal distributions q; satisfy 


1 
qz; = 1) = 5 exp( (log p(b, x | z: = Data) > (3.36) 
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with the partition function 
Li = 5 exp( (log p(b, x | £i = 2 : (3.37) 
se{0,1} 


Inserting the probabilistic model defined in (3.33) the update simplifies to 
-1 


qi(r; = 1) = ; (3.38) 


C 
1 + exp (^ + 5 ver) 


c=1 


with the prior term 7, = log E and the data expectation term for 


one camera given as 


u = (Spam Be, Ac(x|z;—1))) en) 


(3.39) 
= A Ac(x]z;-0))) isa f 


A detailed derivation of the POM update equations (3.38) is given in 
Appendix A.2. For more insights we additionally refer to Sect. 5.2.1, where 
the mean-field update equations are derived and interpreted for a similar 
posterior distribution. Note that although the derivation given in this 
section is different from the one given in the original POM paper [38], the 
resulting update equations for the distributions q; are identical (cf. [38, Eq. 
25]). 


3.2.2.1 Approximation 


Since the expectations in (3.39) are not tractable, the POM authors propose 
to approximate the expectations by only taking into account the expecta- 
tion of the image generation process rather than the full image distance. 
For s € (0, 1) the approximation is given as (cf. [38, Eq. 26]) 


(Seem Ges A-(x|@i=8))) su B Ópom (be, (Ac(x]zi—5)) &\2,)) . 
(3.40) 
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(a) A. (b) AE! 


(c) Az? 


Figure 3.5 Schematic illustration of synthetic average images. (a) shows an exemplary 
average image for a distribution q(x) with four non-zero marginal distributions q;(x;). (b) 
conditional synthetic average image A = (A,(x|xz;=1)) diad where the black rectangle 
corresponds to the grid location i. (c) shows the corresponding conditional average image 
given that x; = 0. The images (a) - (c) only differ at the pixel values in the rectangle related 
to the grid cell with index i. Figures based on [38, Fig. 6]. 


Due to the binary images the expectation over the synthetic images 


A, = (AR) a(x\x,) (3.41) 


is easy to compute and can be interpreted as an average image (see 
Fig. 3.5(a)). The unconditioned average image A, is defined for every 
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pixel (u, v) as (cf. [38, Eq. 30]) 


Afu v] = (AR) 0) y) J G7 9G; 7 2). (8:42) 


j:Al[u,v]-1 


where the index set {j | A7 [u, v] = 1} includes any grid index j for which 
the corresponding synthetic image A? contains a foreground pixel at the 
image coordinate (u,v). The inverse probability in (3.42) ensures that 
pixels not influenced by any q; are assigned to a probability of zero. The 
conditional synthetic average image for camera c with respect to the cur- 
rent mean-field state q(x) and x; forced to state s € {0,1} is given as 


AL? = (Ac(xlz; = Sar) : (3.43) 


Considering the state s € {0,1}, the average image AL? is obtained 
by forcing all pixels in A, [u, v] effected by the state of x; (all pixels of A}, 
which are equal to 1) to 0 or 1 respectively (see Fig. 3.5(b,c)). 

This leads to the following two implications: (i) the conditional synthetic 
average image A’ is a function of q(x V x); (ii) the conditioned average 
images A17 and A‘ only differ from the average image A, in the pixels 
effected by the rectangle A; related to x;. The final POM update equations 
are given as 


1 
AA ge exp (7 + E, (Bombo: AE) — Bombo, 479). 


(3.44) 


In order to estimate the distribution q, Fleuret et al. [38] propose an iterative 
update scheme. In contrast to CAVI (Sect. 3.1.4), which uses asynchronous 
mean-field updates, a synchronous update is used. This yields the advan- 
tage that the average image A, needs only to be calculated once per itera- 
tion. Moreover, by exploiting integral images, the distance ôpom (be, ASS 
can be calculated in constant time. In total this leads to a real-time capable 
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estimation of the marginal distributions q,(x;). For a detailed explanation 
of the fast iterative update scheme used by POM we refer to the original 
paper [38, Sect. 5.4]. 
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In this chapter we introduce a probabilistic model for people detection 
in multiple depth images. First we declare basic prerequisites, covering 
aspects such as sensor network calibration and multi-view image evidence 
(Sect. 4.1). Hereinafter, we propose an abstract general probabilistic model 
(Sect. 4.2), which includes the generative scene model and the related 
modeling of the data likelihood. Based on the general model, we derive 
two different manifestations, sharing the same generative scene model and 
functional structure (e. g. conditional independence of the views). Namely, 
we put the general model in concrete terms for the continuous latent space 
(Sect. 4.3) and for the discrete latent space (Sect. 4.4), respectively. All the 
common aspects of the discrete and continuous model are covered by the 
abstract general model. 


4.1 Prerequisites 


4.1.1 Sensor Network Setup 


While the methods proposed in this work can in general be applied to any 
kind of depth sensor network, we will take some initial assumptions on the 
setup of the depth sensor network in order to simplify further modeling. 

First, it is assumed that the observed scene has one common plane 
at ground level, which we refer to as ground plane in this thesis. This 
assumption can be justified considering typical man-made environments, 
e. g. train stations, shopping malls or other indoor environments. Of course 
this assumption does not hold in some particular relevant scenarios, e. g. 
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Figure 4.1 Stitched multi-sensor image based on marker-free calibration. The colored 
quadrilaterals in the left image represent the view frustums intersected with the plane one 
meter parallel to the ground level. The three images on the right show the individual sensor 
views with corresponding color encoding. 


a staircase connected to a hallway. It is viable to combine the proposed 
method with a more general ground plane model, e.g. a composite of 
multiple planes, in order to handle such scenarios. 

Second, we assume that the depth sensors are intrinsically and extrin- 
sically calibrated in advance. If the sensors share overlapping fields of 
view in a way that there is at least one path connecting all sensors, the 
extrinsic calibration can be obtained by visual cues. Due to the available 
depth data, marker-free extrinsic calibration can be achieved in three steps: 
(i) for each sensor S;,...,S¢ the ground plane is estimated by a plane 
fit; (ii) one arbitrary sensor coordinate system is defined as the common 
world coordinate system; (iii) for the other sensors S, the rigid body trans- 
formation to the common world coordinate system is obtained by the 
corresponding natural image features in the overlapping fields of view. 
Due to the available depth data, each natural image feature match implies 
a world point correspondence. For the rest of this work we define P, as 
the projection matrix for each sensor S., which maps a point from the 
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common world coordinate system to the corresponding image coordinates 
of each sensor. 

As a direct consequence of the two former assumptions, we describe 
the presence of people in the scene in ground plane world coordinates 
(x,y) € R”. A particular constellation of individuals on the ground plane 
is referred to as a scene configuration. In this thesis scene configurations 
are given in two manifestations: a list of continuous plane coordinates 
(cf. Sect. 4.3) or as a discrete ground plane grid (cf. Sect. 4.4). 


4.1.2 Observations 


As image evidence, we use low-resolution depth images, obtained by pas- 
sive stereo-vision-based depth sensors’. For inference, we use foreground 
segmented depth images, which simplifies the proposed generative model 
(Sect. 4.2.1). The foreground segmentation is obtained by static background 
subtraction, applied to the raw depth images. For real-world applications 
with a dynamic background, an online learned background model could 
be employed (for a comprehensive review of background subtraction 
methods we refer to [55]). Fig. 4.2 illustrates the intensity images, raw 
disparity images and foreground segmented depth images, observed from 
three synchronized sensors. 

Typically, the image acquisition time of commodity (depth) cameras can 
not be synchronized across a network of cameras trivially. To overcome 
this limitation, we use a pseudo synchronization, where each sensor is 
synchronized with a global timer over ethernet via the standard Network 
Time Protocol (NTP). By annotating each captured image with the global 
acquisition time, it is possible to align the observations from multiple 
sensors by temporal proximity. As a result we obtain multi-view frames 
with limited time difference within a single temporal frame. Considering a 
typical raw frame rate of 25 Hz, the theoretical maximum synchronization 


1 Although our method is particularly able to handle noisy low-resolution depth images, it 


is not limited to it. Certainly it is also viable to use high-quality depth observations e. g. 
from an active range sensor. 
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(a) Sensor view 1 (b) Sensor view 2 (c) Sensor view 3 


(d) Disparity image 1 (e) Disparity image 2 (f) Disparity image 3 


(g) Observation 1 (0) (h) Observation 2 (05) (i) Observation 3 (03) 


Figure 4.2 Example observations from the multi-view setup with five people present in the 
covered area. Each column corresponds to one sensor. The first row (a-c) shows the gray 
scale intensity images. In the second row (d-f) the raw disparity images are depicted. The 
third row (g-i) illustrates the foreground segmented depth images, which are finally used for 
inference. 


error is 5558 = 20 ms. On average the synchronization error in our setup 
is roughly 10 ms, which is tolerable for the application of people detection. 
For the rest of this thesis we neglect the synchronization error and assume 
that the multi-view frames are acquired synchronously. 

Even though the models and methods introduced in this thesis are 
designed to operate on foreground-segmented depth images, they are 
not limited to these specific observations. By defining a forward model 
which enables the generation of synthetic observations (cf. Sect. 4.2.1), a 


different sensor modality can be easily incorporated into the proposed 


4.2 General Model 


framework. It is even possible to combine different sensor modalities (i.e. 
types of observations), as long as the forward model for each sensor is 
well-defined. 


4.2 General Model 


In this section we define the joint distribution of people present in the 
scene over space and time. To keep the notation uncluttered, we reuse the 
symbolic latent variable X introduced in Sect. 3.1. While in the previous 
section X is defined as an arbitrary symbolic latent variable, here we 
refine the semantics of A. In this section Y, represents a symbolic scene 
configuration at time step t. While the abstract scene configuration Y 
is used to elaborate the general structure of the probabilistic model, it 
can be instantiated by any kind of scene state representation, including 
continuous (Sect. 4.3) as well as discrete (Sect. 4.4) representations. The 
foreground-segmented depth observations at the time step t, acquired 
from depth sensors $, ... Sc, are given as o, = (04 4,---,0c4)'- 
The joint probability distribution for time steps 1,..., T is given as 


PAK 1,017) = porr | X wr)p(X 17). (4.1) 


To keep the joint distribution tractable, we make two general assumptions. 
First, we assume that the probability of the current state Y, depends only 
upon on the previous state V,_, (first order Markov assumption). For 
t > 1 we can write 


p(X4] X141) = P(e | 4,4). (4.2) 
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Using the Markov assumption, we then can express the distribution of a 
path of scene configurations as 


PAX yr) = PK | Y yr 1 pol sai) (4.3) 
= p(X r | Xq-i)p(X 7-1) (4.4) 

T 
= p(X) | [p(x | ¥4_-1)- (4.5) 

t=2 


Second, we assume that for given scene configurations X ,.7 the observa- 
tions are conditionally independent 


E 
plorr | X17) = IELA ; (4.6) 


t=1 


Inserting the immediate results from the first order Markov assumption 
(4.5) and the conditional independence assumption (4.6) into the joint 
probability model (4.1), we can write the joint distribution as a first order 
Hidden Markov Model (HMM) 


T 
P(X 1.7, 01-7) =pl01 | 4 1)p( 1) IEG |X PEt | Xi). (47) 


t=2 


Based on the joint distribution (4.7), we will describe the general structure 
of relevant distributions in the remainder of this section. For ease of 
notation, we omit the integration domain, where f p(¥ 1.7) d¥ ,. isa short 
hand for fp: fp p(X 1,7) dX, ... dX p with D defined as the full domain 
of X. Note that in this section the integrals are defined over an abstract 
state space implied by X'. Depending on the concrete manifestation of the 
scene configuration space, the integrals have to be refined or, in case of a 
discrete scene configuration space, replaced by the sum over all discrete 
states respectively. The distributions derived in the following will find 
its expression in models for realizations of specific scene configuration 
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spaces in Sect. 4.3 and Sect. 4.4. For a detailed derivation of the general 
distributions, we refer to Appendix B. 

In the context of Bayesian inference, typically the most relevant ques- 
tion is how the latent variables are distributed, given the observations. 
Formally, conditioning the joint distribution (4.1) on the time series of 
observations 0.7 leads to the full posterior distribution 


pour, Vir) (4.8) 


p(X imr | 01:7) = p(o1.r) d 


In consideration of the first order Markov assumption and the conditional 
independence of observations (summarized in (4.7)), the posterior can be 
rearranged for any t € (2,..., TP to 


plo, | APA | A a)p( 1:41 | 014-0) 
plo; | 0141) 


; (4.9) 


PX ia | Ort) = 


as derived in (B.2). In the literature, the denominator p(o, | 0;.;_1) of (4.9) 
is also referred to as the predicted likelihood and is, according to (B.3a), 


given as 


plo 01123) = | nos Lp ona) (4.10) 


Since the full posterior distribution given in (4.9) is hard to compute in real 
world applications due to the integral in the predicted likelihood (4.10), 
it is often more viable to model the probability of the current state Y, 
given the past observations o,.,. Marginalizing the posterior distribution 
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p(X 44 | 014) with respect to the previous states X ,.,_, leads, for any t € 
(2, ..., T), to the recursively defined distribution 


plo, | €) 
p(X s (O14) = Pet ew 
ASOT) o lama 


I AA 
Sl 
»(X«|o4-1) 


(4.11) 


as deduced in (B.4a). In the Bayesian filtering framework this is referred to 
as the filtering or update distribution, where the integral over the previous 
state X ,. , in (4.11) is called the predictive distribution 


soris: J ARA onada (4.12) 
T 


previous filtering dist. 


To summarize, rearranging the joint model (4.7) allows to define the 
recursive filtering distribution p(Y, | 01.4), whereas the previous states 
are condensed in the predictive distribution p(X, | 01, 4) . Since the 
predictive distribution is defined recursively and only includes an integral 
over the previous state Y, ,,it can be approximated effectively in practical 
applications. For a more general discussion of typical inference problems 
in Hidden Markov Models we refer to Barber [11, p. 495 ff.]. 


4.2.1 Generative Scene Model 


To define the probabilistic model we make use of a generative scene model 
G.(X,P.), which maps a scene configuration X and a given projection 
matrix P, to a synthetic observation (i.e. synthetic depth image) from 
the perspective of sensor $.. We use a simple, rotationally symmetric 3D 
person model depicted in Fig. 4.3(a), consisting of a cylinder for the body 
and a sphere for the head. Exemplary samples of the proposed generative 
scene model are given in Fig. 4.3(b-d). 
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(a) 3D person mesh 


(b) G(X, P1) 


(c) Go(¥, P5) (d) G4(X, P3) 


Figure 4.3 (a) 3D polygon mesh used in the generative scene model to represent an individ- 
ual, where the head is modeled by a sphere and the body by a cylinder. (b-d) Synthetic depth 
images for a given scene configuration X from the perspective of each sensor. 


Since our generative forward model is not only a function of Y, but also 
of the projection matrix P., we incorporate the physical sensor model in 
a principled way into our framework, allowing us to detect people from 


arbitrary viewpoints and to easily integrate a new sensor modality. 


4.2.2 Data Likelihood 


Analog to POM [38] (Sect. 3.2), we assume that the views are conditionally 
independent with respect to a fixed scene configuration A to make the 
likelihood tractable. Since we assume that only people are part of the 
foreground (cf. Fig. 4.2), and that the depth images are robust regarding 
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illumination changes, this assumption can be justified. Similar to (3.29), 
the likelihood factorizes as: 


C 
p(o14)=][ plo. | 4). (4.13) 


c=1 


To model the likelihood p(o, | X) for one depth observation, we assume 
that an observation can be described by the generative scene model pro- 
posed in Sect. 4.2.1 as a functional relation o, = G.(X,P.). Of course, 
this is a simplification which does not hold in reality for the proposed 
generative scene model, e. g. due to different shapes and poses of individ- 
uals as well as measurement noise. However, it allows us to employ an 
effective and tractable likelihood model. To weaken this assumption we 
additionally assume that the observations suffer from measurement noise, 
thus an observation o, is modeled as a realization of a random variable 


O. = GQ, P.) +7, (4.14) 


where the random variable 7 represents the measurement noise and Q a 
random variable related to a scene configuration A. In the literature this 
model is known as additive noise model [19, p. 42]. In general, detailed 
modeling of the depth measurement noise of a stereo vision sensor is 
challenging because several sources of uncertainty, such as image plane 
quantization, geometric camera calibration, and stereo correspondences 
matching, affect the final depth measurement (cf. [16, 51, 72]). For the sake 
of computational feasibility, we assume additive Gaussian noise? , thus 


n~ N (o. ops T) : (4.15) 


This implies the assumption that the total measurement noise y is given as a sum of many 
statistically independent noise sources y = nı + --- +n. If in addition the Lindeberg 
condition [57, p. 355 ff.] is satisfied, the Lindeberg-Feller central limit theorem indicates 
that the distribution of y approaches a normal distribution. 
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Assuming a fixed scene configuration state Q = Y, the measurement 
noise 7 and the scene configuration €? are statistically independent 7 1l. Q 
and the only randomness in O, is the measurement noise y, yielding an 
observation likelihood 


plo. | Y, Oops) = N (o. GA P,), aeos T) 
1 E (4.16) 
x ex (-—— lo. ER G(X, PoI) : 
2 obs 
In [19, p. 41 ff.] detailed remarks on the likelihood construction incorpo- 
rating additive noise can be found. 


4.3 Continuous Latent Space 


In this section we introduce the first manifestation of the general model 
presented in the previous Sect. 4.2. The simplified probabilistic model 
proposed in this section serves two main purposes. On the one hand, 
it provides an introductory model, for which it is fairly easy to derive 
the MAP objective and get a point estimate by solving the resulting non- 
linear least-squares optimization problem (Sect. 5.1.1). On the other hand, 
the resulting gradient-based inference can be used as a fine-tuning post- 
processing step in combination with the discrete model and inference 
methods proposed in Sect. 4.4 and Sec. 5.2, respectively. 

For the sake of simplicity, we assume that the number of people m 
in the scene is known a priori. Once m is fixed, a scene configuration 
X in the continuous latent space can be formally defined as an m-tuple 
X = (X4,..., X4) of ground plane world coordinates x; € R?, where each 
X; corresponds to the location of an individual. Based on the general 
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joint model (4.7) and the independence assumption (4.13), we define the 
posterior distribution 


IL UOS p(ocı | x) p(X, | X, .1)p(X4) 
p(oi.) 


p(Xyr | 01.7) == ’ (4.17) 


with the definition p(X, | X9) = 1. In comparison to the general joint 
distribution (4.7), we append a prior distribution p(X,) to (4.17), enabling 
the incorporation of a-priori scene knowledge. The data likelihood at one 
time step 1 2 ploc | X) is identical to the general model as introduced 
in Sect. 4.2.2. For inference in continuous latent space we are only inter- 
ested in the MAP state, therefore we refrain from further consideration 
of the evidence p(o,.7) in this section. The two remaining terms, namely 
the dynamics model p(X, | €, 4) and the prior p(€,), are defined in the 
following. 


4.3.1 Dynamics Model 


We employ a dynamics model to express the relation between two consecu- 
tive scene configurations. It is assumed that the movements of individuals 
are independent of each other, thus the dynamics term factorizes over the 
individual person locations X; 4, 


Tri 


p(X; | &ı-ı) = [[ iu | Xi. 41) . (4.18) 


i=1 


Since we assume only small movements of persons between two consecu- 
tive frames, we do not consider the velocity of individuals, but employ a 
first-order motion model. Hence, it is assumed that a person location at 
time step X; , is a noisy version of its predecessor X; ,. ;. To further sim- 
plify the inference on this model, we assume that X; , is normal distributed 


52 


4.3 Continuous Latent Space 


around X; ¿_1. Thus the dynamics distribution for an individual person 


location X; , is given as 
p(€,, | Xii x) = N (€, | Xii X) . (4.19) 


In Sect. 5.1.1 it turns out that this simplified Gaussian dynamics model 
leads to an effective and tractable objective for MAP inference. For a broad 
discussion of motion models in the context of people tracking we refer to 
Fleet [37]. 


4.3.2 A Priori Assumptions 


To incorporate given knowledge of the underlying scenario, we add prior 
terms to further restrict the set of likely scene states. For a scene configura- 
tion at one time step X, we employ two independent prior assumptions, 
Pbox(Xı) and paist(X+), with p(X;) = Ppox(X+)Paist(X+). Since the prior 
terms are independent of the current time step, we omit the time index t 
in this section. 

The first prior term ppox (X) reflects our knowledge of the visible ground 
plane (which can be inferred by the sensor network calibration, cf. 4.1.1). 
We model this knowledge by assuming that a person location X; is uni- 
formly distributed in the observable rectangular area, 


Po (X) = ] [»&) (4.20) 
i—l 
with 
PR) = U min: Xmax) » (4.21) 


where U (X min: Xmax) is the uniform distribution over the approximated 
rectangular area. In consequence the probability p(X;) is zero for X; being 
outside of the observable area. 
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The second prior can be motivated by the interpersonal distance. De- 
pending on the social context, two individuals in general keep a certain 
distance to each other in order to preserve their private space. We therefore 
consider the distance between all possible pairs of individuals in the scene. 
Formally, the distance prior factorizes over all possible pairs of person 


locations 


Paist(X1 BR Rn) = een Xj) 3 (4.22) 


Modeling the joint probability between two locations as a zero mean 
normal distribution with respect to the inverse distance 


1 


~ ~ ’ 
I&-&l,+e 


(4.23) 
we can write the pairwise joint probability as the unnormalized pseudo 
distribution 


1 
PX) | Odist) X EXP (- = TEESI ] (4.24) 


dist 


with cai, being the std. deviation of the inverse distance. To calculate the 
partition function this distribution needs to be defined on a finite interval. 
However, this is not relevant in this case, since for this model we are 
only interested in an MAP point estimate, which does not depend on the 
constant normalization term anyway. In Fig. 4.4 an exemplary function 
plot of the unnormalized distribution (4.24) is depicted for a fixed person 
location X ;. Itis clearly recognizable that the proposed pseudo distribution 
has the desired effect: the distance between two individuals x;,X; is 
positive correlated to the pairwise probability density f(&;, €; | vai). 
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y (m) 


Figure 4.4 Exemplary surface plot (a) and contour plot (b) for the pairwise distance prior. 
The unnormalized density (x; = (x, y), €; = (0,0) | cai; = 2) is plotted as a function of 
ground plane coordinates (x, y) with respect to a fixed individual at X; = (0, 0). 


4.4 Discrete Latent Space 


The continuous latent space, as introduced in the previous section, has 
strong limitations. First, the number of people in the scene has to be known 
a priori. Second, in Sect. 5.1 it turns out that the final MAP inference highly 
depends on a good initialization due to the underlying gradient based 
continuous optimization. To overcome these mentioned shortcomings, 
we propose a discrete scene configuration space. Similar to Fleuret et al. 
[38], the basic idea is to model the ground level plane as a discrete grid, 
where each grid cell can either be occupied by a person or be empty. In 
the robotic community, the general concept of a discrete grid of binary 
occupancy states is also referred to as an occupancy map (cf. Thrun [94]). In 
contrast to the continuous latent space model, and as a direct consequence 
of the occupancy map, there is no need to explicitly model the number 
of individuals in the scene to define a latent space of fixed dimensions. 
Moreover, the discrete scene configuration space allows us to define a prob- 
abilistic model in which approximate inference can be obtained effectively 
by mean-field variational inference (cf. Sect. 5.2). For the formal definition 
of the discrete probabilistic model, the ground plane area is discretized 
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AS 


Figure 4.5 Discrete ground plane grid, projected into all three sensor views. Each grid 
location (turquoise point) is associated with one Bernoulli random variable X; +, reflecting if 
a person is occupying the grid cell at time step t. 


into a 2D-grid of n locations (see Fig. 4.5). Each grid location u, will be 
assigned a realization x,, of a Bernoulli random variable X; , ~ B(p, ,), 
where 1; denotes the probability of a person present at location u, at 
time step t with 1 < t < T. The scene configuration for one time step t 
is then given as the vector x, = (1 ;,...,Un.1)' € (0,1)" (cf. POM [38], 
Sect. 3.2.1). 

Following the assumptions of the general probabilistic model defined in 
Sect. 4.2, the joint distribution for a sequence of discrete scene configura- 
tions x,.7 is given as 


T C 
P(X1.7, 017) = II I]»@.. | x) P(X | Xii); (4.25) 


with p(x; | xo) = 1. As a direct consequence of the joint distribution 
(4.25) and the general distributions in (4.9)-(4.11), we formulate three 
distributions for the discrete model: (i) the data posterior distribution, 
which omits the temporal context; (ii) the full posterior distribution; and 
(iii) the recursively defined Bayesian filtering distribution. For all three 
distributions we propose approximate inference methods in Sect. 5.2. 


(i) Data posterior p(x, | o,) Omitting the temporal relationship between 
consecutive scene states, the data posterior distribution allows us 
to express the likeliness of a scene configuration x,, given the ob- 
servations from all sensors at one time step o,. Assuming that the 
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prior for a scene configuration factorizes as p(x) = [[;_, p(z;), the 
posterior distribution is defined as 


IES ploc | x,) THE p(z;:) 
Dix’ €{0,1}” I P(e. | x’) IL p(x;) 


The data likelihood follows directly from the general model defined 
in Sect. 4.2.2. 


p(x; | 0,) = (4.26) 


(ii) Full posterior p(x,.7 | 0,.7) The full posterior takes all observations 


from time steps 1,..., T into account and models the likeliness of a 
path of scene configurations x,,...,x. The full posterior is given 
as 


T C 
D(X iT | 01.7) = Ik= I plosa not $ | 11) (4.27) 


def. 


with p(x, | x9) = 1. In addition to the data posterior, the full poste- 
rior includes a state transition distribution p(x, |x;_ı) which rep- 
resents the evolution of scene configuration states over time. The 
underlying probabilistic dynamics model is defined in Sect. 4.4.1. 
Note that the full posterior is hard to compute in real-world sce- 
narios since the complexity grows with the number of time steps. 
However, during inference it can serve as a baseline for comparison 
with the recursively defined filtering distribution and can find its 
application in scenarios where offline batch processing” is viable. 


(iii) Bayesian filtering distribution p(x, | 0;.,) In contrast to the full pos- 
terior, the filtering distribution models the likeliness of the current 
scene configuration state x;, given all observations o;,., up to time 
t. Applying the Bayesian filtering framework allows the recursive 
definition of the filtering distribution, where the past is condensed 


3 In this context batch processing refers to a detection method taking a sequence of frames 


as input rather than operating in a typical frame-by-frame manner. 
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in the filtering distribution p(x;_, | 0.1) from the previous time 
step (as described in Sect. 4.2). Applying (4.11) from the general 
model, the filtering distribution for the discrete scene configuration 
space is given as 


plo; | x¿)p(X, | 01:11) (4.28) 


p(x |o : ) m > 
a plo, | 0141) 


with the predictive distribution 


p(x; | 014-1) = 5 P(X: | xii) : P(Xe-1 | O1:4-1) - 
n —— 
X, €{0,1} previous filtering dist. 


(4.29) 


4.4.1 Grid Dynamics Model 


The complete construction of the distributions over space and time (4.27), 
(4.28), requires the definition of a state transition model p(x, | x, ,) that 
describes the evolution of scene configurations over time. In this work we 
do not focus on tracking but on leveraging the temporal context to regu- 
larize the mean-field optimization (see Sect. 5.2). Therefore, we propose a 
grid based dynamics model, without modeling the explicit motion of ob- 
jects. This leads to a computationally convenient model which represents 
the flow of probability mass across space and time. 

Computationally feasible inference can be achieved by assuming condi- 
tional independence of individual grid cell states z, ;, given the previous 
state x, ,. Hence, the state transition distribution factorizes as 


TL 


P(X | xii) = lI» | E) . (4.30) 


j=1 


This assumption is limiting the expressiveness of our model significantly 
because it prevents from modeling the relationship between grid cells at 
the current time step t. This can be illustrated by the following example. 
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Xt-1 


p(zj. | X11) 


Figure 4.6 Schematic illustration of the transition distribution p(x jt | X11) . For an exem- 
plary grid cell u; the probability p(x jt | XQ 1) (right image, red marked cell) is defined as a 
weighted sum over the transitions from all grid cells occupied in the previous time step (left 
image, yellow grid cells). 


Consider a person present at grid location u, at time step t — 1, moving 
to grid location u; or u; in the next time step t. Regarding the factor- 
ization (4.30) it is only possible to assign two independent probabilities, 
D(vj, | x,_1) for the person present at grid location u, and p(x; | x. 1) 
for the person present at grid location u,, respectively. However, it is not 
possible to express that the person moves to either grid location u; or u; 
but cannot occupy both grid cells at time step t. To be able to model the 
joint probability p(z; ,, £j , | x¿_1) at least a pairwise factorization of (4.30) 
is required, thus p(x | x4—1) = [jni P(it, ja | xii), with j ~ i denoting 
all pairwise direct neighbor indices. However, such an assumption would 
strongly increase the complexity of our model, leading to a computation- 
ally intensive inference problem, despite the mean-field approximation 
introduced in Sect. 5.2. Therefore, we leave this as an open question for 
future work and stick to the simplification (4.30) for this work. 
Considering the transition distribution (4.30) we need to define the 
probability p(z;; |x:_ı) of a person present at a grid cell u, at time step 
t, given the previous scene configuration state x, ;. As illustrated in 
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Fig. 4.6, the basic idea is to express the distribution p(zj | X, 1) asa 
weighted sum of the transitions from all previous x; ,..; being in state one 
(meaning that a person is present). The desired probability distribution is 
formally derived by first adding a hidden latent auxiliary variable to the 
joint model and then marginalizing over this variable*. More precisely, we 
first introduce an auxiliary latent random variable Z to the distribution 
plz. | X1), indicating the presence of a single individual in the previous 
state. Second, we marginalize over Z to express p(x;, | x, 1) as a sum 
over all transitions from grid cells occupied in the previous state x, ,. 

Let Z be a one-hot-encoded random variable with the realizations being 
Z = (z,,...,24)! with z, € {0,1}, such that 377 , 24 = 1. Further, let 
p(z, = 1) refer to the probability of a realization z with exactly one non- 
zero element z;,. As a consequence of the one-hot encoding the distribution 
of z can be written as 


TL 


plz) = | [»:**. (4.31) 


k=1 


with the definition 0° = 1. For the ease of notation we introduce weighting 
coefficients wy € [0,1] with $7; , Wg = 1 such that p(z, = 1) = wp, thus 
(4.31) can be rewritten as 


plz) = || wi: (4.32) 
k=1 


Introducing z to the distribution p(x; + | x;_) leads to the joint distribu- 
tion 


plzj1,2 | X4 1) = plz. | Z,X1-1) D(z |x4-1). (4.33) 


* This principle is well know in probabilistic modeling, e. g. a similar technique is used for 


mixture of experts models (cf. Barber [11, p. 448 ff.], Murphy [70, p. 344 ff.]). 
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Semantically z; is equal to one if a person is present at cell u, in the 
previous scene configuration x,_,, thus zz ¿1 = 1. In consideration of the 
one-hot encoding of z and (4.31) it follows 


n 


p(z;, t | Z, Xp 1) = l Dit | Zk = 1,1)” (4.34) 
= p(zja | % =1), (4.35) 


where we omit the conditioning on x,. ,. The resulting discrete distribu- 
tion p(x "m | — 1) reflects the probability of x; , given that one particular 
cell with index k is one in the previous state x,_,. Marginalization of (4.33) 
with respect to z results in the mixture model 


p(x; 4 | X, 1) = 5 p(zj4, 2 | X1-1) (4.36) 
z€{0,1}":|z|=1 
= vnu | ak — 1) p(% =1 | X, 4) (4.37) 
k=1 
= X plija | zk = 1) “Ups (4.38) 
k=1 


According to the definition of the weighting coefficients w; (cf. (4.32)), the 
distribution p(z, = 1 | x,_,) can be interpreted as normalization weights 


r, ife,,i=1 
A ae (439) 
0, else . 


Finally, the probability flow depends on the definition of the transition 
distribution p(z; , | 2, = 1) in (4.38), which denotes the probability that a 
person is present at location u;, given that a person was present at location 
u, at the previous time step. In our setup we expect only limited move- 
ment of individuals between two consecutive frames. Hence, we assume 
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0.0 0.2 0.4 0.6 0.8 1.0 
Probability 


(a) xi 1 (b) p(x|x;_1) 


(c) xi 1 (d) p(x|x4 1) 


Figure 4.7 Example of proposed discrete dynamics model for two scene configurations (a,c) 
with the corresponding output distributions (b,d) for tiger = 0.2 and wj, = 1. 


individuals will either move to an adjacent grid cell with a probability of 
Fine, Or stay at the current cell with a probability of u,.ır (see Fig. 4.7). Let 
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B(- | u) be the probability mass function of a Bernoulli distribution with 
the parameter u, then the transition distribution is given as 


B(x;, | H'seit) ; if j =k 
D(x; 4 | Zk = 1) == B(v;, | Hne) > if j € Ny (4.40) 
else, 


l—-cji, 


with N, being the index set of the direct neighbors of up. We define 


bs Hself 
== (4.41) 
Hne EA 
which leads to the special case where the emitted probability for one person 
present at the previous time step equals to one. As a direct consequence 
of the chosen transition model (4.40), which only allows movement in the 
direct neighborhood, we can set 


1, if -u Sth 
MEER DNE (4.42) 
0, else, 


while (4.38) still meets the requirements of a probability mass function. 
This has the additional effect that the expected number of people in the 
scene with respect to the dynamics model stays constant, thus 


(ella) (ja, ,) = ralla - (4.43) 


Note that because of the normalization weights w, it is generally also 
possible to use more sophisticated transition probability distributions. 
In Fig. 4.7 two concrete samples of the proposed dynamics distribution 
p(x; | xi.) are given. Fig. 4.7(a), (b) show the trivial case with only one 
person present at cell u; in the previous time step t — 1. According to (4.40) 
the probability mass is distributed uniformly with Hne = 0.1 to the direct 
eight neighbors and is 0.2 at cell u, at time step t. In Fig. 4.7(c),(d) a more 
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complex example with ||x, ||, = 9 persons present at the previous time 
step t — 1 is presented. 


64 


5 Inference 


In the previous chapter we proposed different probabilistic models for 
people detection in multiple depth images, introducing assumptions about 
the scene and the sensor observations. In this chapter we will focus on the 
probabilistic inference! regarding those models. In Sect. 5.1 we derive the 
MAP objective for the continuous model introduced in Sect. 4.3 and show 
how the final non-linear least squares problem can be solved by gradient 
based optimization methods. Instead of just obtaining an MAP point 
estimate, in Sect. 5.2 we propose to use mean-field variational inference to 
approximate the varieties of discrete probability distributions introduced 
in Sect. 4.4. In contrast to the aforementioned generative probabilistic 
inference methods, in Sect. 5.3 we propose a discriminatively trained multi- 
view CNN architecture, allowing a direct comparison between generative 


and discriminative inference in Chapter 6. 


5.1 Maximum a Posterior Inference in 
Continuous Latent Space 


In this section we present an inference method for the continuous scene 
model introduced in Sect. 4.3. For discrete scene configuration space, the 
proposed inference method allows the approximation of the full poste- 
rior distribution. In contrast, for the continuous case we only obtain a 


maximum a posterior (MAP) point estimate. As defined in Sect. 4.3, it is 


' Probabilistic inference refers to the task of estimating the probability distribution (or in 


case of MAP inference just the mode of the distribution) of one or more latent variables, 
given some evidence (observed variables). 
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(c) 


Figure 5.1 Optimization result for MAP objective without temporal dynamics. (a) shows 
the depth observations with the synthetic depth images corresponding to the initial scene 
configuration as green overlay. (b) shows the synthetic depth images corresponding to the 
final MAP result as blue overlay. The initial positions are given as green dots, the intermediate 
states of the optimization are drawn in purple. (c) illustrates the final optimization result 
back projected into the camera image of each sensor. 


assumed that the number of people in the scene is known a priori for the 
continuous model. This assumption is crucial for the proposed inference 
method, since it allows straight forward gradient based optimization on 
the location estimates of individuals in the scene. As a consequence of this, 
the proposed method is not suitable for inference in real-world applica- 
tions without further pre-processing (e. g. estimating the number of people 
in the scene by a coarse person detection method). Still, it can be seen as a 
complementary method to the proposed discrete inference method, since 
it effectively enables fine-tuned discrete person localization (cf. Sect. 6.6, 
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in particular Fig. 6.20). In the following we first derive the MAP objective 
(Sect. 5.1.1) and subsequently focus on practically solving the resulting 
non-linear least squares optimization problem (Sect. 5.1.2). In Fig. 5.1 an 
exemplary optimization process for one multi-view frame is depicted. 


5.1.1 MAP Objective 


With respect to the continuous probabilistic model defined in (4.17), the 
maximum a posteriori (MAP) objective is to estimate a sequence of scene 
configurations X, ‚r, which most likely explains the sequence of observa- 
tions 04.7. According to (4.17) the posterior distribution can be restated 
as 


O 
p(X1.7 | Our) X Il TI plo... | X) p(X, | Xv 1)p(X,). (5.1) 
t=1 Lc=1 


data likelihood 


The mode of p(X ,.7 | 04.7) is referred to as the maximum a posterior scene 
configuration. In consequence the MAP objective is formally given as 


Eir = arg max p(X,. | 01.7) (5.2) 
Xır 


Tre 
= arg max log (II | ploc | X) 
t=1 Le=1 


Xi — 


P(X, | er) > 
(5.3) 


where in (5.3) we substituted the logarithm of (5.1). Applying the product 
rule of logarithms (5.4) and inserting the data likelihood and the dynam- 
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ics model (5.5), we can recast the objective as a non-linear least squares 
optimization problem 


T 


£, TAE — Y loe(o Oc.t | X,)) + y log (p(X, | X4_1)) 


Kur t=1 c=1 t=1 


+ y log(p(X,)) (5.4) 


T 
= arg min D loe — CAR) Poll + XC £; a 
t=2 


Kur 421624 
T T 


En = Ber + 5 pee (5.5) 
t=1 


regularization 


where the terms Fox, Egist in (5.5) correspond to the box prior and the 
distance prior defined in (4.20) and (4.22) respectively. Since the likelihood 
and prior terms defined in (4.13) and (4.20, 4.22) only depend on the scene 
configuration X, at a single time step, they are aggregated as a sum over 
each time step t. 

From an optimization perspective, the prior terms act as regularization 
terms, restricting the set of possible scene configurations. The term Epox 
penalizes person locations X outside of the observable area. Continu- 
ous numerical optimization methods require differentiable energy terms. 
Therefore, we approximate the box penalty by a function which is zero for 
person locations in the observable area and increases with the distance to 
the border of the observable area. Let X7 be v-th component of X; ,, then 
the box penalty is given as 


EM E 32] [max (an = Xi E 0) + max(X; = SENTA 0) > (5.6) 
i=1 ve{1,2} 


where X min» {mar denote the borders of the observable area. 
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The distance energy term is a direct result of the prior proposed in 
(4.22), with an additional max function, which assigns costs to all pairwise 
distances smaller than x = Im: 


m-—l1 m 2 
i PS 1 
pes 5 b» [mas CET 2 o) : (5.7) 


We introduce the weighting parameters Atemporal» Abox» Adist € R* to 
balance the impact of the individual energy terms. The final non-linear 
least-squares objective for estimating the MAP path of scene configurations 
X1.r7 given a sequence of observations 01.7 (cf. posterior distribution 


(4.17)) is given as: 
T T 
Y : 2 2 
Xir = arg min X` 5 llo. E GX, P.) + MS ¡EN a X, all 
*ur 421 c= t=2 
data likelihood temporal dynamics 


T T 
b di 
E Abox 5 E, = RE Adist 5 E, = : 
t=1 t=1 


a priori assumptions 


(5.8) 


The temporal dynamics term connects a scene configuration X, with the 
preceding scene configuration X, ,, enabling joint optimization of the 
scene configurations X.7. The dynamics term can be interpreted as a 
temporal smoothing term, ensuring temporally consistent scene configura- 
tions. In fact, qualitative experiments (Sect. 6.6.2) show that this smoothing 
property can effectively prevent the optimization from getting stuck ina 
local minimum. 
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For single frame inference the temporal term can be dropped and the 
objective for a scene configuration at a single time step trivially follows 
from the full objective (5.8): 


$ = arg max p(X; | 0;) 
X, 
C (5.9) 
= arg min > > llo. =, ETT Po! Hr Ago E?" RE Agi E? P : 


x c=1 


5.1.2 Optimization 


Estimating the MAP sequence of scene configurations is obtained by nu- 
merically solving the non-linear least squares (NLLSQ) objective (5.8), 
using a continuous optimization method. In this work we use the iterative 
NNLSQ solver dogleg [76, 77]. Dogleg is a trust-region method, combining 
the Gauss-Newton algorithm with gradient descent. Notice that our ob- 
jective does not depend on a specific solver, thus one could also use other 
NLLSQ solving methods, such as the well-known Levenberg-Marquardt 
algorithm. For a comprehensive discussion on NLLSQ solving methods 
we refer to [71, p. 254 ff.]. 

Of-the-shelf gradient based NLLSQ methods require that each part of 
the objective has to be differentiable. Regarding the objective (5.8) this 
is the case for the regularization terms. The crucial part of (5.8) is the 
data likelihood term ||o. — G.(X,, Pol (cf. residual images in Fig. 5.2). 
Calculating the gradients for every output pixel with respect to X, requires 
differentiating the generative scene model function G.(X,,P.), which 
maps a scene configuration X, to its corresponding synthetic depth image 
(cf. Sect. 4.2.1). More precisely this involves the calculation of the partial 
derivatives for every output pixel G, ;(X,, P.) with respect to each scene 


configuration variable X, = (X4,...,X,,)! =(%11,%12,--- Em 1 Emo) € 
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(e) Iteration 16 


Figure 5.2 Residual images showing the optimization process for a single time step. Each 
column corresponds to one sensor c and each row to an optimization iteration. The images 


show the normalized per-pixel error ||o. — G.(X,, P.)|3, where black pixels correspond to 
the minimal and white pixels to the maximal error. 
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R?". Formally the desired partial derivatives are given as the Jacobian 


matrix 


a a vg 
G G 
Jot = "a3 a as = (5.10) 
* OX, — O(Xi1,315,... Im.) VG 
c,R 


with R being the number of depth pixels and the row vector V"G.. ; being 
the gradient of the depth value of the j-th pixel, computed by the render- 
ing function. However, by virtue of discontinuities typically occurring 
at the boundary of objects due to occlusion, the rendering pipeline is in 
general not differentiable. To overcome this challenge we employ the 
approximative differentiable renderer framework OpenDR [68]. OpenDR 
approximates the partial derivatives of the rendering pipeline with respect 
to some input variables of the rendering process. By approximating the 
Jacobian matrices J., with OpenDR and using automatic differentiation 
(cf. [44]) for the other terms, one can use of-the-shelf NLLSQ solver im- 
plementations to optimize the objective defined in (5.8). In Fig. 5.2 the 
gradient based optimization progress for a single multi-view frame in 
terms of residual images is depicted. 

A drawback of the proposed gradient based approach is the dependence 
of the optimization result on the initialization. If the initial scene con- 
figuration corresponds to generated depth images where the individual 
renderings do not have any overlap with the observed image data, the 
gradients for those individuals will be zero. In consequence the optimiza- 
tion potentially gets stuck in a local minimum around the initialization 
state. Fig. 5.1(a) depicts an example where the initial 3D person renderings 
share some overlap with the observations. We soften the dependence on a 
particular initialization by applying a typical coarse to fine strategy. For 
an improved convergence behavior, the data error term is calculated for 
six layers of a Gaussian image pyramid. 
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While the coarse to fine strategy improves the convergence behavior, the 
proposed MAP approach still depends on a good initialization. Moreover, 
for inference prior knowledge about the number of people present in the 
scene is assumed (cf. Sect. 4.3). Both arguments emphasize that the MAP 
approach is suited for the refinement of a pre-existing scene configuration 
initialization, rather than providing people detection from scratch. In the 
evaluation (Sect. 6.6) it is shown that the proposed MAP method can be 
effectively used to fine-tune discrete detections obtained by mean-field 
variational inference (Sect. 5.2). 


5.2 Mean-Field Variational Inference in Discrete 
Latent Space 


In this section we introduce the main contribution of the present thesis. 
We propose approximate inference methods for the discrete scene configu- 
ration space introduced in Sect. 4.4. Based on the mean-field variational 
inference (MF-VI) method described in Sect. 3.1, we deduce the mean- 
field update equations to effectively approximate the desired probability 
distributions defined in Sect. 4.4. For better comprehensibility we first 
omit the temporal context and start with applying mean-field variational 
inference to the posterior for one time step (4.26) in Sect. 5.2.1. Building 
on these foundations, we extend the mean-field update equations in order 
to approximate the Bayesian filtering distribution (4.28). In Sect. 5.2.3, we 
finally deduce the mean-field update regulations in order to effectively 
approximate the full joint posterior distribution (4.27) of people present in 
the scene across space and time. 


5.2.1 Data Posterior Distribution 


In this section we propose a probabilistic inference method for the dis- 
crete posterior distribution p(x, | o+) (4.26) at a single time step t. Inspired 
by POM [38] (cf. Sect. 3.2), we employ mean-field variational inference 
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(cf. Sect. 3.1.3) to effectively approximate the desired discrete distribution 
p(x: | 0+) (4.26). First, we derive the corresponding mean-field update 
equations in detail. In Sect. 5.2.1.1 we introduce an approximation for the 
deduced update expectations, enabling an iterative mean-field optimiza- 
tion algorithm for real-world applications. Note that this section is an 
extension of work previously published [108, Ch. 3] by the author. 

Since we only consider the observations at one time step in this section, 
we omit the time index t for ease of notation and restate the distribution 
(4.26) of a scene configuration x given the observations o as 


TI ploe |x) Ia pa) 
c I ey (5.11) 
3 x €{0,1}" Wes p(o. | x’) i- p(z) 


p(x |o) = 


Due to the high dimensional scene configuration space (0, 1)”, the parti- 
tion function (evidence) in the denominator of (5.11) is intractable, and 
we cannot directly compute the posterior distribution. Instead, we apply 
Kullback-Leibler variational inference (cf. Sect. 3.1.2) to approximate the 
inconvenient distribution p(x | o) by a simpler proxy distribution q(x). 
Following the reasoning given in Sect. 3.1.2, we propose to use the reverse 
KL-divergence. Thus the objective for optimizing q(x) can be expressed as 


q(x) = arg min KL(q(x) || p(x | 0)) 
j (5.12) 
- mean) — log p(x | 0)) ¿(x) - 


As elaborated in Sect. 3.1.3, a family of probability distributions (q-family) 
for the proxy distribution q(x) has to be defined. According to Sect. 3.1.3 


(3.19), we apply the naive mean-field assumption, which states that q(x) 
is a product over its marginal probabilities 


q(x) = II qi(z,) . (5.13) 
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Fig. 5.3(a) shows an exemplary mean-field state q(x). Note that the mean- 
field assumption does not imply that the grid states x,,...x, in our model 
are assumed to be independent of each other. It is only a statement of the 
structure of the proxy distribution, which effects the iterative optimization 
schema. In order to deduce the mean-field update regulations a probability 
distribution over all x except of one single element x, has to be defined 
(cf. Sect. 3.1). Let q(x \ ;) denote the mean-field distribution excluding 
the element z,, 


qx\2)= [[ ye). (5.14) 
j=1:j#i 


According to Sect. 3.1.3 (3.26), the general mean-field equations 


ail) x exp( (log (x 10), 5.) (5.15) 


update a single marginal distribution q;(r;) depending on the previous 
mean-field state q(x X z;). In Sect. 3.1.3 it is deduced that updating q;(z;) 
asynchronously according to (5.15) will decrease the KL divergence in 
(5.12) (see also Barber [11, 625 ff.]). Since each x; is Bernoulli distributed, 
(5.15) for x; being in state 1, can be written as 


1 
4; = 1) = Z, exp( (log p(o, x |z; = Diss) (5.16) 
with the partition function 


Zi = > exp ((log p(o, x |z; = 5) 2) : (5.17) 
se{0,1} 


Considering that the posterior is proportional to the joint distribution 
p(x | o) x p(o,x), we could substitute the posterior distribution p(x | o) 
in the inner expectation of (5.15) by the joint distribution p(o, x), while the 
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Figure 5.3 (a) Exemplary mean-field distribution q(x), where each grid cell corresponds to 
one marginal probability q,(w, = 1). (b) Mean-field update for one marginal distribution 
qi (x; = 1) (5.19) as a function of the expected value V;. 


normalization by Z; (5.17) ensures that (5.16) still satisfies the requirements 
of a probability mass function. 

Applying the equality > = T to (5.16) leads to the simplifica- 
tion 


q;(x; = 1) = i x exp (tonno. |x; = 0) e) 
(5.18) 


= 
ele Dis] 


Using the linearity of expectation and the quotient rule of the logarithm 
we can rearrange (5.18) to 


—1 


-=0 
qu (2, = 1) = |1+exp (e on) , (5.19) 
plo, x | zi = 1) / goes) 
V. 


i 


with Y; being the mean-field expectation related to a marginal distribution 
qu(z, = 1). 
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In Fig. 5.3(b) the mean-field update for one marginal distribution q;(x; = 
1) is plotted as a function of the expectation V;. Studying (5.19) and 
Fig. 5.3(b) already reveals the basic idea of an iterative mean-field update. 
The update of the probability of a person being present at location u; 
(hence x; = 1) increases as the expected value of the log ratio nuke 
decreases. Considering the expectation with respect to the current mean- 
field state q(x V z;), the numerator reflects the probability of the grid 
location u; being not occupied by a person, while the denominator reflects 
the probability of the grid location u; being occupied. The correlation 
between the binary states of a scene configuration is taken into account 
as the expectation with respect to the current mean-field state q(x V z;), 
which evolves over optimization iterations. Informally, (5.19) answers 
the following question: Considering the observations o and the current 
mean-field state, is the presence of a person at location u; more likely than 
the non-presence? 

Given this basic intuition for an update of a distribution q(x V z;), we 
further continue with formally deducing the final mean-field equations. 
Inserting the probabilistic model defined in (4.16) and (4.26), the uncondi- 
tioned expectation in (5.19) can be expressed as 


(log p(0,X)) ya) = (og plo | x)P()) qaa; 


-( Na lloc Bo) 
—— ÁÀ 


c=1 4 


9(0,,G,(x)) q(x \z;) 


C 
aa 3 (loe, Ge(X))) arar) + dog P(X)) an) > 


c=1 
(5.20) 
with the image similarity function 
1 2 
ôl, b) = y llh - Lll. (5.21) 
20 cbs 
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Conditioning (5.20) on x; = {0,1}, inserting into (5.19) and making use 
of the linearity of expectation allows us to restate the final asynchronous 
mean-field update as 


—1 


C 
c=1 
with the data expectation for one sensor given as 
Vo; = (6(0., Ge(x|z;—1)) — 6(0¢, Ge(x|m;—0))) qeu.) > (5.23) 


as well as the prior term 7;. Based on the independence assumption 
p(x) = [I4 p(x,) in (5.11) the prior term simplifies to 


Ti = (dos lA! 2) = log Igea (5.24) 
p(x | v; = 1) a(x\x;) p(x; = 1) 


Note that G.(x|x; = 1) maps a scene configuration x to a synthetic depth 
image in the perspective of sensor S, with x; forced to 1 (cf. Fig. 5.4). 

Fig. 5.4 depicts synthetic images and observations of three sensors, com- 
pared in expectation V, ; (5.23). For the calculation of Y. ; these synthetic 
images need to be generated and compared with the observations for 
every possible scene configuration state. The images in Fig. 5.4 correspond 
to an exemplary scene configuration x. Studying one row 5.4 (a-c) corre- 
sponding to sensor S,, the expectation (5.23) gets more accessible. For a 
given scene configuration x, weighted with the current mean-field state 
q(x X z;), the observation o, is compared with a synthetic image with x; 
forced to one (G, (x|x, = 1)) and then compared with x; forced to zero 
(G,(x|z; = 0)). The difference of both image comparisons reflects the 
likeliness of a person occupying cell u; under the scene configuration x 
and given the observation 0. 

Following the argument given in Fleuret et al. [38], one can also see 
how occlusion is handled implicitly: If the forward-model projection of 
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(b) Gi (x|x,; = 0) 


(d) G2 (x|v; = 1) (e) Go(x|x; = 0) 


(9) G3(x]x, = 1) (h) Ga (x|x; = 0) (i) o3 


Figure 5.4 Illustration of synthetic images and observations compared in the expectation 
V. ; for one scene configuration x. Each row corresponds to one sensor S ¿. The red dashed 
rectangle illustrates the bounding box corresponding to the rendering of a person present at 
location u;. The scene configuration x shown here is zero for every grid location except for 
two neighbors of u;. 


a person located at u; is occluded by a projection of a person with a high 
probability of occupancy, the value of x; does not affect the image distance 
d(0,, G.(x|v; = s)). Thus, the expectation V, ; in (5.23) converges to zero 
and the corresponding marginal distribution q; equals the prior. 


5.2.1.1 Approximate Mean-Field Update 


In spite of the mean-field assumption, (5.22) is still intractable due to 
the expectation V, = (-) s.) (5.23), which requires to calculate a sum 
over all possible 2””" scene configuration states x V x; € {0,1}"~". We 
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approximate the expected value V, ; considering only the relevant subset 
of scene configurations. Therefore, we exploit the fact that the difference 


9(0., G.(x]x; = 1)) = 5(0¢, G.(x|z; = 0)) (5.25) 


only depends on the pixels belonging to the silhouette of the projection 
of the 3D model at location u; (cf. Fig. 5.4). For a simpler and faster 
implementation, we approximate the region of belonging pixels by the 
corresponding axis-aligned rectangular bounding boxes, given as I,[u;] 
(red rectangle in Fig. 5.4). Only scene configurations, for which the pixel 
values inside the bounding box I,[u;] of the generated image G.(x) are 
affected, need to be evaluated for the expectation Y, ; in (5.22). We assume 
that only the projections of the direct eight neighbors of a grid location 
u; intersect with the bounding box I.[u,] (cf. Fig. 5.5). For our top-view 
setup this is a valid assumption. However, for a frontal view setup, a 
more sophisticated approximation would be preferable. Consequently, we 
can approximate the expectation Y. ; by the reduced neighborhood scene 
configuration X; € (0, 1)". Since the local neighborhood (including z;) 
allows only 2? — 512 possible local scene configurations, the expectation 
can be approximated efficiently. 

Instead of the image distance 6(-,-) derived from the data likelihood in 
Sect. 4.22, we introduce a weighted asymmetric image similarity 5,,,, (o, 9) 
between a foreground segmented observation o and a generated image g. 
Since there is no need to compute the derivative of the distance function in 
the mean-field optimization, we replace the squared L2-norm by the more 
robust L1-norm. Let M : RY** + (0, 1)" *" be a threshold function 
which maps an image to its binary foreground mask, M(i) = 1— M(i) 
its inverse and © the Hadarmard product between two images. The 


asymmetric image similarity is given as 


dasym(o, 9) =a ||0 © M(g)||, + (2— a) lg o M(o)]], 


(5.26) 
+ lllo— 9) © M(o) o M(g)ll, ; 
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(d) (e) G2(&;) (f) 02 


Figure 5.5 Illustration of direct neighborhood of a grid cell u; for sensor 1 (a-c) and sensor 
2 (d-f). (a, d) show the direct grid neighbors in turquoise and the grid location u; in yellow, 
respectively. (b,e) show the synthetic images where every element in the neighborhood scene 
configuration X; is set to one. (c,f) show the corresponding depth observations. The red 
dashed rectangle shows the bounding box 1. [u;] corresponding to the rendering of a person 
present at location u, in sensor $ ,. 


with the design parameter o € [0, 2]. 

In Fig. 5.6 the asymmetric image similarity is graphically illustrated, 
including the masks of three disjoint cases oO M (g), go M (o) and M(o) O 
M (g), respectively. For o = 1 the image similarity 5,4 (o, g) is identical 
to the L1-norm ||o — g||,. For a > 1 observed depth pixels which are not 
explained by the generative scene model o © M (g) will be penalized more 
strongly (cf. blue mask in Fig. 5.6(d)). Let further 

1 z 
ò = <= Îasym (0e [u;], G.(X;|v; = s)[u;]) (5.27) 


T—s 
20 obs 
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0.2 04 06 08 10 12 14 16 
Ll error / m 


(c) lo — gll, (d) 


Figure 5.6 Illustration of different error terms occurring in the asymmetric image similarity. 
(a,b) show an exemplary observation and synthetic depth image used as input for the 
image similarity asym (0, 9) (5.26). (c) depicts the L1-norm residual image, where purple 
corresponds to a low error and yellow to a high error, respectively. (d) shows the masks used 
in (5.26) as colored overlay, whereas blue corresponds to o © M(g), green to g © M(o) and 
purple to M(o) © M(g), respectively. 
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be the image similarity restricted to the cropped image region /..[u;]. Then 
the approximated expectation can be written as 


Vo, = . (ðs =1 = $20) qi.) . (5.28) 


Additionally, we normalize the expectation with respect to the size 
(number of pixels) of the image slice |I, [u;]|, to account for the viewpoint 
dependent size of a bounding box. In order to efficiently compute (5.28), 
we propose to pre-build for each u; a visual dictionary? of image slices 
I.[u;] for all 512 possible local scene configurations X;. 


5.2.1.2 Optimization Details 


According to Sect. 3.1.4 the final mean-field updates can be executed 
asynchronously or synchronously. In an asynchronous mean-field update 
iteration, the individual q;(x;)’s are updated sequentially, whereas, in a 
synchronous update iteration, all the q;(x;) are updated simultaneously, 
using the same previous mean-field state q(x \ x,). To avoid oscillating 
effects during the optimization we use the asynchronous coordinate-ascent 
variational inference (CAVI) method as listed in Algorithm 1 in Sect. 3.1.4. 
Hence, the probability for each q,(x;) will be updated asynchronously 
with respect to the previous mean-field state q(x \ z;) according to the 
final update equation 


—1 


(5.29) 


C 
1+ exp (^ + 5 2 
c=1 


An exemplary CAVI mean-field optimization with four iterations is de- 


picted in Fig. 5.7. 


2 In this context the visual dictionary refers to a pre-computed list of synthetic depth image 
slices, enabling efficient computation of the expectation V, ;. 
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(b) 


Figure 5.7 Mean-field results for one exemplary multi-view frame. (a) shows mean-field 
state q(x), given as a probability map of the marginals, at every iteration. (b) illustrates the 
final mean-field optimization results G(x), back projected into each sensor view as probability 
maps. 


5.2.2 Bayesian Filtering 


In this section we extend the mean-field approach introduced in the previ- 
ous section by taking the history of consecutive observations into account. 
As introduced in the previous section we use the mean-field assumption 
to approximate the filtering distribution. However, instead of just approxi- 
mating the data posterior p(x; | o,) at a single time step, in this section we 
approximate the Bayesian filtering distribution p(x, | 01:4) (4.28), incorpo- 
rating the previous observations. Since the desired distribution is defined 
recursively (cf. (4.11)), the past state is condensed in the previous filtering 
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distribution p(x,. ; | 01,4. .1). In practice, we use the final result from the 
last mean-field optimization ¿(x,_1) as an approximation of the previous 
filtering distribution. In consequence, each mean-field update directly 
depends on the approximated marginal distributions ¿(x,_,) from the 
previous time step. Following Sect. 5.2.1 the objective is to minimize the 
KL-divergence between the desired distribution p(x, | 01.4) and a proxy 
distribution q(x;¿), 


q(x,) = are min Ste) | p(X, | O1:2))- (5.30) 


Following Sect. 5.2.1 we apply the naive mean-field assumption (5.13), 
assuming the factorization q(x;) = [Ij ai(;) - 

Using the general mean-field equation and applying the transforma- 
tions presented in (5.15)-(5.19), the mean-field equations for the filtering 


distribution are given as 


—1 


q(x; = 1) = |1+ exp (oe p(01., xi | £i = 2) 
p(o14, Xi | Ti = 1) a(xı\r;) 


filter 
Y i 


(5.31) 


Inserting the probabilistic model from (4.28), using the linearity of expec- 
tation and the equality log(%) = log £ + log b : a,b,c,d > 0 allows the 
separation of the filtering expectation we into two disjoint parts; the 


already known data expectation and a predictive expectation 


pu = ye (os p(xi | 01:01, Ti = 2) . (5.32) 
p(x; | 014 1,2; = 1) a(x4\2;) 
—— Á 
pred 
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The data term for the current time step is the same as defined in (5.22), 
thus given as 


C 
gata — S bo.., G¿(x]x,=1)) — 9(o,, Ge(x|z;—0)) qu ,,,. (5-33) 


c=1 


The dependency on observations from previous time steps 0,.,. , is con- 
densed in the recursively defined predictive distribution p(x, | 01.,—1). 
After inserting the predictive distribution (4.29) and applying the pro- 
posed factorization of the dynamics model (4.30), the expectation red 
expands to 


pred _ is P(x: |Xı-1,2: = 0)p(xi-i| 01:11) 
V; =(lo 
NN p(x, | xy 1,2; = 1)p(xi a | 01:11) Ber 


tos Esa Hen, Pena | easi = O) Pa | 01:41) 
= 08 d 
PP jen! p(zj, | Xp1, Ti = px | 014-1) qa(x A z;) 
(5.34) 


with 57 being the shorthand for >” n and N; = N; Ui the 
Xi x+—1€{0,1} 
index set of the direct neighborhood of the cell u;, including the index 
i. As a direct consequence of the chosen transition function p(x, | x;_1) 
(4.40), which assumes that a person moves only in the direct neighborhood 
in one time step, the set of scene configurations can be reduced to the local 
neighborhood scene configurations x; of the cell u;. Therefore, the sum 
in (5.34) can be replaced by 25... 610,1) 
of the predictive expectation feasible. The filtering distribution of the 


», which makes the computation 
last time p(x¿_1 | 014.1) states the likeliness of a previous state x;_,. In 


practice this distribution is approximated by the result of the last mean- 
field optimization ¿(x;¿_ ;). Under this assumption and after applying the 
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definition of expectation to the nominator and denominator of (5.34), the 
approximated predictive expectation can be written as 


(Ten: p(z;, | XQ, 4,0; = 0)) 
Tex P(x 5. | X¿-1,% = 1)) 


G(X4_1) 


re — (os 
a1) eG) 


(5.35) 


Informally, the influence of (5.34) on a mean-field update (5.31) can be 
interpreted as follows: considering the distribution of the previous state 
x¿_1, the proposed dynamics model p(x; | x;_,), and the current mean- 
field state q(x, V z;), is the presence of a person at location u; more likely 
than the non-presence? 

The final mean-field optimization for the Bayesian filtering distribu- 
tion is implemented analogous to the data distribution elaborated in 
Sect. 5.2.1.2. 


5.2.3 Temporal Smoothing 


Building upon the ideas introduced in the previous sections we address 
the approximation of the full posterior distribution p(x;.7 | 01,7) in this 
section. In contrast to Sect. 5.2.1 and Sect. 5.2.2, where the marginal prob- 
abilities for one single time step t are approximated, we now aim for a 
joint approximation of the distribution of people present in the scene for a 
sequence of time steps. As a consequence, we extend the naive mean-field 
assumption in (5.13) to a factorization over a sequence of scene configu- 
rations x].r and deduce the corresponding mean-field update equations. 
Notice that the following section is an extension of previously published 
work [109, Ch. 3.B] by the author. 


87 


5 Inference 


q(x \ Tit) 


Figure 5.8 Illustration of the fully factorized temporal mean-field distribution q(x1;7 Vz; «). 
The temporal mean-field distribution can be separated into T spatial mean-fields q(x;,). For 
an update of a single marginal distribution alzi) the expectation is taken with respect to 
all other spatio-temporal nodes. 


Following the previous sections, we use mean-field variational infer- 
ence to approximate the distribution p(x}.r | 01:r) by a simpler proxy 
distribution q(x,.7). The optimization objective is given as 


q(X1.7) = arg min KL(q(x}.7) || p(xi.7 | 01:r)) (5.36) 
q 

= arg min(log q(X1r) — log perl rra.) (637) 
q 


In order to perform joint inference for a sequence of scene configurations 
X 1-7 a structure for the proxy distribution ¢(x,.7) enabling a computation- 
ally feasible approximation needs to be defined. We extend the mean-field 
assumption (5.13) by additionally factorizing over time 


n T 
q(Xi.7) = II II qi (Tit), (5.38) 


i=1t=1 


where each q; ,(z;,) denotes the marginal probability distribution of a 
person present at location u; at time step t (cf. Fig. 5.8). Analogous to (5.14) 
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let 


n T 
q(Xir \ Tit) = Il Il qj, kj) (5.39) 


be the mean-field distribution without the element z; , (cf. Fig. 5.8). 
According to the general mean-field equation (cf. Sect. 3.1) the optimal 
update with respect to the objective (5.37) is given as 


4.2.) x exp( (log P(r 017) aus.) (5.40) 


Re-arranging as in (5.15)-(5.19), the final update for x; , being in state one 
is given as 


quema =1) = [14 exp (a )] ^. (5.41) 


Inserting the probabilistic model defined in (4.27) and using the relation 
log(%) = log“ + log : a,b,c,d > 0, the expectation won in (5.41) 
expands to 


smooth = ( log P(Our, Xi | Vit = 0) (5.42) 
| p(01:7,Xı:r | a 1) q(xir Ni +) 
T = 
= (olli pna =0) (543) 
II plor | Xk Tis = 1) 


- log Hn p(x; | Xk1 Tit = 0) ) 
alar Wi.) 


Iss p(xi, | Inu) 


Using the linearity of expectation, (5.43) can be expressed as the sum of a 
data and a temporal expectation 


th di 
ouem we L gep., (5.44) 
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Studying the expression in (5.43) reveals that all terms in (5.43) indepen- 
dent of x; , cancel out, since the nominator and denominator for the factors 
independent of x; , are identical. Hence, the data term can be isolated to 


T = 
data = (os Hi: p(ox | xx ie = 2) (5.45) 
IIa p(or | Xk Tig = 1) a(X1.7\%:,2) 
— (1g Pc | ota = 9) (5.46) 
e 


In consequence the data term does not depend on the temporal context 
anymore and is identical to the data term derived in (5.22). According to 
(5.43) and (5.44), the temporal expectation is given as 


T: 
d o— 1» #i =0 
VIT = c Ik=ı PC | ee ) ) i (5.47) 
q(xi 7 Wei) 


Is p(xi, | Xk-1 Fir = 1) 


Following the same argument as for the data term, all factors which are 
independent of the state x; , cancel out. Depending on k, the forced state 
x; can be either part of the condition or the argument of the distribution 
P(Xx | Xy 1, £i = s). Analyzing the temporal expectation un P reveals 
that the factors in the product ]];_, p(xy | Xy 4,2; = s) with s € (0,1) 
can be separated into three disjoint cases, depending on the value of k: 


1) If k — t, the corresponding factor is given as p(x: | XQ 4, Tit = s) ; 
Since x; , is an element of the scene configuration x, , this distribu- 
tions depend on the state of x; ,. 


2) If k = t +1, the corresponding factor is given as p(X;,1 | Xs £i) - 
Here the condition x, depends directly on the state of x; , . 


3) For all other values of k with k A t ^A k Æ t + 1, the distribution 
p(xx | xx 1, 2,1) is independent of x; , . In consequence those fac- 
tors are identical for x; , = 0 and x; , = 1 and cancel out in (5.47). 
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Considering these findings and applying the linearity of expectation, we 


can separate the temporal expectation Ur P (5.47) into two parts, referred 


to as past and future expectation, 


temp _ p(x: | XQ) Tit = 0) (X41 | Xt, Lig = 0) 
Vie = (log 
' p(x: | XQ 4, Tit = 1)p(Xt44 | Xp, Tit = 1) CN M 
Sige p(x; | Xt) Tit = 0) (5.48) 
p(x | xi 2m 1) q(xiNzi oxi) 


fe (os P(Xt41 | Xt, ta = 0) : (5.49) 


P(X | Xp, Lit = 1) E 
ee 


future 


Wit 


t . 
where Y?" corresponds to the past expectation and Y° corresponds to 


the future expectation. Based on this intermediate result, both expectations 
can be further simplified. Inserting the dynamics model (4.30) in (5.48) 
and considering that all terms which are independent of x; , cancel out, 
the past expectation is given as 


yeast = (os p(x, | X15 Fit = 0) ) (5.50) 
ati) x4-1) 


p(x: | Xi1 Vit = 1) 


ES Lew; (vj, | X, 4,2;, = 0) 


= lee (5.51) 
jen: P(x;4 | Xt—1: Tit I3 1) ies 


= joa rl =1|%.-) (5.52) 
plz: =1 | xy) a(x 1) 


Finally, the past expectation only depends on the previous state x, 4, 


thus the expectation is only taken with respect to the distribution q(x,. 4). 
Following the derivation of the past expectation, inserting the dynamics 
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model (4.30) in (5.49) and canceling all terms independent of x; +, the 
future expectation simplifies to 


va = (os P(Xt44 | Xp, Tit = 0) ) (5.53) 
a(x AD; 65x61) 


P(Xt44 | Xt, Lit = 1) 


A (os Ijen; tiens | xe tit =0) (5.54) 


len p(z;ca | Kt Vit = 1) M 


E Y log” pun | Xe zi, = 0) ) , ^ (555) 


, DU 5 t+1 | Xi Tit = 1) 


a(x Ai xa) 


and depends on the current state x, and the state x,,, at the next time 
step. 

Since the dynamics model (4.40) introduced in Sect. 4.4.1 operates only 
on a local neighborhood, we only need to consider the reduced neigh- 
borhood scene configurations X;, € (0, Ee making the estimation of 
the expectations (5.50) and (5.53) computationally feasible. Under the re- 
duced scene state distributions, the final expectations used in the temporal 
mean-field optimization are given as the past expectation 


1- TX 
pres = ( log P(E | Xi 4-1) (5.56) 
y P(xit zi | Xit) a(&,. 1) 


and future expectation 


future __ CO | Xit Tit = 0) 
Ve = Y; log m . (557) 


‚Player | Xue -— 
q(X; X544) 
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Inserting into (5.41), the final mean-field equations for the temporal smooth- 
ing approach are given as 


—L 


Gl = 1) = [1 +exp | WS" + y uite . (5.58) 
p emp 


i,t 


5.2.3.1 Optimization Details 


As in the previous sections, for the mean-field optimization we use CAVI 
(cf. Sect. 3.1.4). Thus all q; ,(:) are updated sequentially according to (5.58) 
with respect to the previous mean-field state q(x;.7 V 2;,). In Algorithm 2 
the joint temporal mean-field optimization algorithm is listed. In each 
iteration the time slices q(x,) are consecutively updated from 1,...,T. 
This implies that the temporal context does have a direct impact on the 
estimation of the data term on the next iteration, since the mean-field 
distribution after one iteration is effected jointly by all temporal frames. 
Since the future term (5.53) relies on the mean-field state from the next 
time step we disable the future term in the first iteration. Considering 
that people can enter the observable area, we initialize all border grid cells 
with a probability of 0.5. To weight the temporal terms we extend (5.44) to 
Viro = WI + Apat VP + Afuture Vor With Apasts Aure € [0, 1]. 


5.3 CNN Inference 


In the previous sections, inference is obtained by iterative optimization 
methods, explicitly derived from the probabilistic model introduced in 
Sect. 4. In this section a different method is applied for inference in discrete 
latent space. We propose an end-to-end multi-view convolutional neural 
network (CNN) architecture to approximate the marginal probabilities of 
people present in the scene for a single time step. In contrast to the explicit 
iterative mean-field optimization, the CNN framework is demanding at 
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Algorithm 2 Joint Temporal Mean-Field Optimization 
1: procedure OPTIMIZETEMPORALMF 


2: G(X1.7) + init() > init mean-field 
3: forall m € {0,..., ITERATIONS} do 
4: for allt € {1,...,7} do > iterate along time axis 
5 for alli € (1,...,n] do > iterate over all grid cells 
-1 

: & c [1+ exp (Oa, os an) 
8: Q(x1.r) — update(d; +) > asynchronous MF update 
9: end for 

10: end for 

11: end for 


12: end procedure 


training time, but once the network is trained, inference can be obtained 
by a single deterministic forward pass. For a fair comparison with the 
proposed mean-field methods (Sect. 5.2) and to overcome the lack of a 
domain-specific large scale data-set, we sample from the generative scene 
model introduced in Sect. 4.2.1 to generate synthetic training data. To 
narrow the gap between synthetic training images and real-world depth 
sensor observations, we extend the generative scene model to generate 
randomized synthetic training images (Sect. 5.3.2). In contrast to classical 
data-driven approaches the proposed multi-view CNN is only trained 
with synthetic depth images and does not rely on any real-world training 
data. Analogous to the mean-field inference for one time step, the network 
takes three foreground segmented depth images as input and predicts the 
marginal probability distributions of people present in the scene (cf. Fig. 
5.9). Since the generative mean-field inference and the discriminative 
multi-view CNN framework only rely on the generative scene model, 
without further supervision, the methods can be directly compared at 
inference time. This section is an extension of previously published work 
[112, Ch.2] by the author. 
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Table 5.1 Parameters of CNN feature extraction. 


CNN block Layertype Filters Kernel Size 


1 Conv (1,*) 32 5x5 
2 Conv (2,*) 64 3x3 
3 Conv (3,*) 128 3x3 
4 Conv (4,*) 256 3x3 
5 Conv (5,*) 512 3x3 
1-5 Max Pool $ 2x2 


5.3.1 End-to-End Multi-View CNN Architecture 


In this section we introduce a multi-view CNN architecture for people 
detection in multiple depth images. For a fair comparison with the proba- 
bilistic methods, the same discrete grid model as proposed in Sect. 4.4 is 
used. Since we only consider the observations at one time step, we omit the 
time index t for ease of notation. Hence, a scene configuration is given as 
x= (z4,,...,24)! € [0,1)” and the vector of foreground-segmented depth 
observations at one time step is stated as o = (0,,...,0,)'. The objective 
of the proposed end-to-end approach is to approximate the distribution 


n 


p(x |o) = | [pe | 0) (5.59) 


i=l 


with p(x; | o) being the marginal probability of a person present at ground 
plane location u; given the observations o. To approximate (5.59) we 
propose a multi-view CNN architecture (cf. Fig. 5.9) which jointly exploits 
the depth observations from three sensors. 

We observe no significant drop in performance when the input depth 
images are scaled-down by a factor of 2. Therefore we use input depth 
images of size 188 x 120 for each individual CNN-head, yielding the 
advantage that the GPU memory footprint can be reduced significantly. 
Generalization over the visual features is achieved by weights sharing 
across the input CNN-heads. The resulting feature maps of each CNN- 
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View 1 View 2 View 3 
(120,188) (120,188) 


Shared weights 
CNN Head 1 
CNN Head 2 
CNN Head 3 


(7680,1) (7680,1) (7680,1) 


ReLU 


ReLU 


Sigmoid 
(180,1) | 


Predicted detection probabilities 
projected on ground floor 


Figure 5.9 Overview of our proposed CNN multi-view architecture. Each input depth 
image serves as input fora CNN module. The output of the last fully connected layer predicts 
the marginal probability distribution of people present in the scene. 
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in ee 


T 
Random scene 
configuration 


P1, P2, P3 


Sensor network calibration 


Figure 5.10 End-to-end multi-view training with randomized synthetic depth images. 


head are concatenated and fed into a multilayer perceptron (MLP) in order 
to learn correlations between the individual views. Each CNN-head is 
built of five blocks sharing the same structure. Overfitting is prevented by 
applying a dropout layer [90] after each block. The retention probability is 
set to penn = 0.25. The detailed parameters of the CNN layers are given 
in Tab. 5.1. The three resulting feature vectors are concatenated and fed 
into the first fully connected layer FC1. After the first two fully connected 
layers, dropout with retention probability ppc = 0.5 is used. The final fully 
connected layer FC3 is followed by a sigmoid function and predicts the 
desired marginal probabilities of people present in the scene (5.59). 


Training End-to-end training of the multi-view CNN is achieved by 


formulating the estimation of the desired marginal probabilities in (5.59) 
as a binary classification problem, thus using the binary cross-entropy loss 


be = -+J vi log Alas) + (1- y) log (1 (3). (860) 
i—l 


with y = (yi, ....u4)! € (0,1) being the ground truth scene configura- 
tion and f(z;) being the predicted probability of a person present at cell u,. 
Training samples are created by drawing a random scene configuration, 
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(a) (b) (c) 

(d) (e) (f) 
Figure 5.11 Synthetic depth images generated from the scene model for one specific scene 
configuration x in sensor view one. (a) shows the synthetic depth image based on the static 


person model as introduced in Sect. 4.2.1. (b)-(f) show four independently drawn samples 
from the proposed randomized person model for the scene configuration x. 


used as ground truth y and as input to generate randomized synthetic 
depth images in perspective of each sensor, which are used as input for 
the multi-view CNN. An overview of the end-to-end training process 
is depicted in Fig. 5.10. For training, back propagation and mini-batch 
gradient decent is used. As optimizer, we use Adam [56]. 


5.3.2 Randomized Synthetic Depth Image Generation 


The randomized generative scene model is an extension of the generative 
model proposed in Sect. 4.2.1. The basic model is built on a static rotation- 
ally symmetric 3D person model, consisting of a cylinder for the body and 
a sphere for the head, cf. Fig. 5.11(a). Synthetic depth image generation 
is obtained by placing 3D person models in the scene depending on the 
provided scene configuration x. Each 3D person model is rendered into 
the perspective of each sensor S, using the given extrinsic and intrinsic 
camera parameters. We extend the static person model by introducing a 
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parameterized 3D person model to express different shapes of persons 
in the scene. Randomization is achieved by treating the parameters of 
the person model as random variables. Each individual person model 
is defined by a set of vertices V = [vy,...,v,,) with v = (zy, z)' and 
a set of faces F where each face is given by a triple of vertices. The set 
of vertices Y is split into two disjoint subsets V = Vy; U Vspn, where the 
vertices V.,ı correspond to the cylinder (body) and the vertices Vspn to the 
sphere (head) of a person model. The separation of the two geometric 
primitives is used to apply transformations independently on the cylinder 
or the sphere of a person model. As world coordinate system we define the 
z-axis perpendicular to the ground plane (ry-plane with z = 0), represent- 
ing the height over ground. It is assumed that a person mesh is initially 
centered in the xy-plane with the foot point at z = 0. Diversity in pose and 
shape is provided by three principle degrees of freedom: (i) deforming the 
body of a person (circular cylinder) to an elliptic cylinder to get a variety 
of rotationally asymmetric shapes; (ii) rotating the person model around 
the z-axis to model the body orientation; (iii) resize the height of a person. 
These variants are expressed by the parameterized transformation 


s, 0 0 
588) = R(Y) 70 Sy O}-v, (5.61) 
0 0 s, 


which applies non-uniform scaling followed by a rotation around the 
z-axis with angle y on a single vertex. To generate a single instance we 
apply the transformation f(v; S7, Sy, S7, y) to all vertices in set V.y and 
F(v;1,1, 57,0) to all vertices in the set Vsph respectively. The parame- 
ters $7, Sy, S2, $7, y are considered to be uniformly distributed random 
variables (cf. Algorithm 3). To get more variations in shape we add inde- 
pendent Gaussian noise to the x, y, z-components of each vertex v € V. A 
detailed description of the sampling process and the assumed parameter 
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distributions are given in Algorithm 3. Fig. 5.11(b)-5.11(e) show exemplary 


sampled synthetic depth images for a scene configuration x. 


Algorithm 3 Randomized generation of synthetic depth images. 


1: procedure SAMPLEFROMGENERATIVEMODEL 


2: Veyi Vspn; F + init() > init with default model 
3 h ~ U(2,6) > drawn number of expected persons 
4 x ~ B(1/h) > draw scene configuration 
5 for all x, = 1 do > iterate over cells with a person 
6: Sz Sy ~U(0.5, 1.5) 
7 S7 ~ U(0.85, 1.15) 
8 S2 ~ U(0.85, 1.15) 
9: y ~ U(0,7) 
100 Woy e ((vi Se Ss Sti y)|v € Van) 
11: V'sph  {f(V¥5 1, 1,92, 0)|v € Von] 
12: tar ty ~U (0, 0.1) > draw position offset 
13: for all v € Vo U Va do 
14: Gs Sys Gz ~ N (0, 0.04) > draw AWGN 
15: V vr Ge 
16: v € v c (liz, liy 0)" > move to grid pos. l; 
17: v € v + (tz, ty 0)" > add position offset 
18: end for 
19: renderer.addMesh(V «yı U Vig F) 
20: end for 
21: return renderer.getDepthImages() 


22: end procedure 
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In this chapter we present quantitative and qualitative results for the infer- 
ence methods introduced in this thesis. Due to the lack of appropriate data 
sets for multi-view people detection in depth images, we first introduce a 
novel data set in Sect. 6.1. In order to quantitatively evaluate and compare 
the proposed probabilistic inference methods, we apply a threshold to the 
resulting marginal probabilities to obtain detections and report results in 
precision-recall space (Sect. 6.2). The experiments focus on the evaluation 
of the proposed mean-field variational inference methods (Sect. 6.4). We 
quantitatively compare the mean-field approach as a frame-by-frame de- 
tector (Sect. 6.4.1) with state-of-the-art monocular multi-view approaches 
(listed in Sect. 6.3). Sect. 6.4.2 examines the effect of the temporal context 
on the mean-field variational inference optimization. In order to put the 
results into perspective, we additionally compare with the proposed end- 
to-end multi-view CNN architecture in Sect. 6.5. Since the MAP inference 
method cannot be directly compared with the other methods due to its 
dependence on the initialization of the continuous NLLSQ optimization 
(cf. Sect.5.1), we present qualitative results in Sect. 6.6. Finally, we con- 
clude this chapter with an extensive discussion of the presented results in 
Sect. 6.7. 


6.1 Data Set 


To the best of our knowledge, no publicly available data set covers the 
scenario of top-view people detection using multiple depth sensors with 
overlapping fields of view. Therefore, we introduce a novel data set to 
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Figure 6.1 Overview of the MULTIPLE data set for top-view indoor people detection. The 
data set provides raw stereo image pairs as well as disparity maps obtained from three 
calibrated stereo vision sensors. The presence of individuals is annotated in ground plane 
coordinates. 


compare our approach with state-of-the-art multi camera people detection 
approaches. The data set contains footage from an indoor office scene 
and is recorded from three low-resolution commodity stereo-vision-based 
depth sensors, covering a variety of distributions of individuals in the 
scene (see Fig. 6.2). The stereo sensors are passive and use block matching 
to calculate the disparity maps. 

The sensors have a top-view on the scene, are mounted at a height of 
three meters, and have fields of view with a significant joint overlap (see 
Fig. 4.1). They cover a visible area of approximately 20 m” with up to six 
individual people present in the scene, entering and leaving the visible 
area multiple times. The data set consists of 2800 annotated multi-view 
frames’, captured with a resolution of 376 x 240 pixel each, providing 
raw rectified stereo image pairs as well as disparity maps. In total, we 
annotated the ground level plane locations of more than 12000 targets. 
Additionally, we associated each detection with a track to allow for full 
detection and tracking evaluation. The data set is publicly available and 


1 T a ; ; , 
A multi-view frame contains an image from each sensor at one time step. Depending on 


the context this is also referred to as a temporal frame. 
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referred to as MULTIPLE (Multi-View Intensity-Depth Data Set for Top-View 
Indoor People Detection)". 


6.2 Metrics 


In order to obtain a set of detections, we threshold the output map of 
marginal probabilities. Let p € [0,1] be the detection threshold. By ap- 
plying the threshold to the inference results in the form of marginal prob- 
abilities g(x), the set of detections for one particular value of p is given 
as the grid cell locations (u, | q;(z;) > p}. For the quantitative evalua- 
tion the resulting detections are matched to the ground truth data by a 
nearest-neighbor search. A detection is considered to be a true positive if 
it is within a radius of 30 cm of the ground truth (measured on the ground 
plane in 2D world coordinates). 

In contrast to typical object detection methods, the resulting detections 
are given in ground plane world coordinates, rather than bounding boxes 
in images coordinates. Therefore, classical object detection metrics such 
as the (Generalized) Intersection Over Union (IoU) [84] are not suitable for 
our setup. A typical metric for reporting detection results are the Receiver 
Operating Characteristic (ROC) curves. However, if the underlying classi- 
fication problem is imbalanced, the ROC metric does not reflect the true 
performance of a classification algorithm. As an alternative to the ROC 
metric, the precision-recall space provides a metric for evaluating the clas- 
sification performance on highly class imbalanced tasks (cf. [87]). For a 
comprehensive discussion on the relationship between ROC and precision- 
recall curves we refer to Davis et al. [29]. Considering the present detection 
problem as binary classification reveals that the problem is highly class 
imbalanced, since only a few cells are actually occupied by an individual. 
For the quantitative evaluation of the proposed methods we therefore use 
the precision-recall space. 


? Subsets of MULTIPLE are introduced in [109, 111]. The full data set is available at 


https:/ /www.h-ka.de/isrg/publications/multiple. 
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Figure 6.2 Exemplary samples from the MULTIPLE data set. The first row of each sample 
shows the rectified gray scale image with ground truth annotations on the ground plane, 
where each individual is marked with a unique color. The second row shows the raw depth 
images. 
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Let TP, FP, FN are the counts of the true positives, false positives and 
false negatives, respectively. The precision is then defined as the ratio 


TP 


precision = TP + FP? 


(6.1) 
where the denominator TP + FP is given as the total count of detections 
obtained by the evaluated method. Thus, in the context of object detection, 
the precision can be interpreted as the proportion of correct detections. 
However, the precision does not account for the false negatives, in this 
context the individuals present in the scene not detected. To express 
the trade-off between false positives and false negatives, the precision is 
considered with respect to the recall, given as the ratio 


TP 


recall = TP + EN . 


(6.2) 
In the context of people detection the denominator TP + EN can be identi- 
fied with the number of individuals present in the ground truth. Hence, 
the recall can also be identified as sensitivity or true positive rate. 

To report the performance of our methods we provide precision-recall 
curves (e. g. Fig. 6.3) as a function of the detection threshold p. To summa- 
rize the performance in precision-recall space, we additionally report the 
F1-Score as well as the area under the curoe (AUC). The F1-Score is defined 
as the harmonic mean of precision and recall: 


2 - precision - recall 
F = p 


(6.3) 


precision + recall ` 


To summarize the precision-recall performance in one figure, we report 


the best F1-Score achieved by a particular approach. 
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The AUC (also referred to as average precision) is defined as the (ap- 
proximative) integral of the precision-recall curve. In practice, the integral 
is approximated by the discrete sum over J € N threshold values 


J=1 


AUC = X (recall(p;) — recall(p;+1)) - precision(p;) , (6.4) 
j-0 


with precision(-) and recall(-) being a function of a particular threshold 
value p € (po,..., p J} uniformly covering the interval [0,1]. 


6.3 Approaches for Comparison 


As mentioned in Sect. 2, only a few approaches in literature rely on depth 
images for multi-view people detection in the top-view. To the best of 
our knowledge, the only method directly comparable to our set up is the 
work by Tseng et al. [96]. However, their approach hinges on high quality 
depth data and neither provides a publicly available implementation nor 
a data set. To put our results into perspective, we therefore compare the 
methods proposed in this thesis with state-of-the-art monocular multi- 
view approaches. As a baseline on the given depth observations we 
introduce a difference of Gaussian (DoG) based blob detector. In detail, 
the methods to be compared are: 


= DoG-Detector As a baseline on the given depth data, we apply 
difference of Gaussian blob detection on the foreground segmented 
depth images of each sensor independently and project the resulting 
detections onto the common ground plane. The final detections on 
the ground plane are obtained by proximity clustering. 


= POM [38] is methodically the most related approach to the proposed 
probabilistic inference methods. However, it operates on binary in- 
put observations only (cf. Sect. 3.2 for a comprehensive discussion 
of POM). In the original paper, the binary foreground segmentation 
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masks are obtained by monocular video cameras. In contrast, we 
use the same depth based foreground segmentation masks as in our 
approach for a fair comparison. Furthermore, the grid layout and 
the camera calibrations are identical to our setup for better compara- 
bility. For the experiments we use the original C++ implementation? 
provided by the POM authors. 


= Deep Occlusion [10] is a current state-of-the art end-to-end archi- 
tecture for multi-view person detection (a more detailed description 
is given in Sect. 2.2.1). As input, we stack the given gray scale ob- 
servations to a three channel image to be compatible with the RGB 
architecture. Due to the lack of a large data set, we use the available 
pre-trained model* without any further supervision. Considering 
that the model is pre-trained with RGB images, using stacked gray 
scale images for inference may have a negative impact on the pre- 


diction performance. 


6.4 Probabilistic Multi-View People Detection in 
Discrete Latent Space 


In this section we qualitatively and quantitatively evaluate the main con- 
tribution of the present thesis: The performance of the probabilistic peo- 
ple detection in discrete latent space by mean-field variational-inference 
(Sect. 5.2). Since we compare with methods for frame-by-frame detection, 
we first report comparative results of our mean-field method, omitting 
the temporal context (Sect. 6.4.1). In Sect. 6.4.2 we show the impact of 
additionally using the temporal context on the detection performance. 


7 https: / /www.epfl.ch/labs/cvlab/software/tracking-and-modelling-people/pom/ (ac- 


cessed 30.04.2021) 


E https: / / github.com/pierrebaque/DeepOcclusion (accessed 30.04.2021) 
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Table 6.1 Performance of the evaluated approaches (without temporal context). 


AUC Fi-Score Precision Recall 


Ours (MF-VI Data Term) 0.98 0.95 0.96 0.95 
POM [38] 0.93 0.93 0.91 0.95 
DoG-Detector 0.76 0.86 0.93 0.79 
Deep Occlusion [10] 0.58 0.70 0.80 0.62 


For all experiments based on the discrete latent space we employ a 
ground plane grid with 15 x 12 grid points, corresponding to a horizontal 
and vertical distance of 33 cm between adjacent grid points. 


6.4.1 People Detection for One Time Step 


The results in this section are based on the experiments previously pub- 
lished in [108]. We compare our mean-field variational inference approach 
applied to one multi-view frame, omitting the temporal context as intro- 
duced in Sect. 5.2.1. 

During our experiments we have noticed that our approach is quite sen- 
sitive to the initial marginal probabilities q  (*;). If the initial occupancy 
probability is too small, the expectation in (5.23) will inordinately favor 
scene configurations with only one person present. Thus, occlusion is not 
taken into account in the first iteration. We therefore initialize each mean- 
field node with a prior of qi as) = p(x;) = 0.5. The design parameter of 
the asymmetric image similarity asym (*, -) (5.26) is set to a = 1.25, to penal- 
ize unexplained observations. The standard deviation of the measurement 
noise Cops is set to a default value of 2 cm. 


6.4.1.1 Quantitative Results 


For the quantitative evaluation we use a subset of the MULTIPLE data set 
of 2200 consecutive multi-view frames”. Fig. 6.3 depicts the performance 


3 Only every 10th consecutive multi-view frame is used to calculate the precision-recall 
metric. 
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Figure 6.3 Precision-recall curves showing the performance (precision range [0.5, 1]) of the 
mean-field variational inference approach without temporal context, over all views. 


of the examined approaches over all views. The results show, that the data 
set MULTIPLE is very challenging for deep learning architectures, such as 
Deep Occlusion [10], without domain specific fine-tuning. There are two 
aspects to put the performance of Deep Occlusion into perspective: (i) due 
to the vertical top-view, the appearances of people are drastically different 
compared to the classical profile view; (ii) the original architecture was 
trained on RGB images rather than gray scale images. 

The results of the DoG-Detector indicate that, even when considering 
proximity clustered results from all three views, naive blob-based single- 
view detectors are not competitive compared to the more sophisticated 
multi-view approaches in our scenario. 

Although POM [38] operates on binary input images it achieves remark- 
able performance in our setting. However, our approach outperforms 
POM in terms of precision, resulting in a better area under the curve value 
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Precision-Recall 
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Figure 6.4  Precision-recall curves for different combinations of views. For a fair comparison 
only the people visible in all three views are taken into account. 


(AUC) as well as in a better F1-Score (see Table. 6.1). While POM is quite 
sensitive (every structure of significant size is detected as an individual 
person), our approach is more restrictive due to a more expressive forward 
model, leading to an increased precision. Our hypothesis is that this effect 
might be even stronger in more complex real-life scenarios, due to the 
following reasons: (i) foreground images might include additional objects, 
such as shopping carts or trolley bags; (ii) varying illumination conditions 
yield a higher level of measurement noise. In both cases, the foreground 
images get more cluttered, demanding a robust people detection method. 

In order to show how our probabilistic model exploits the multi-view 
evidence given by all three sensors, we evaluated the performance of our 
approach for all different combinations of sensor views contributing to the 
solution. For a fair comparison, we take only those people into account 
that are visible from all three sensors (see Fig. 6.2 for the fields of view 
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Figure 6.5  Precision-recall curves for different values of the asymmetric image similarity 
parameter a. 


of the sensors). Fig. 6.4 shows the increase in detection performance by 
leveraging the multi-view information. In the mono-view case, View 2 
and View 3 by themselves do not perform well with F1-Scores of 0.61 and 
0.73, respectively. However, combining the image evidence of View 2 and 
View 3 leads to a drastic performance increase, as evidenced by the best 
F1-Score of 0.92. Similar results are reported for the combination of View 1 
and View 2 with an F1-Score of 0.9 and View 1 and View 3 with an F1-Score 
of 0.92. Even though View 1 achieves comparably good performance due 
to the general viewpoint, using the image evidence from all three sensors 
clearly outperforms all other view combinations. These results clearly 
demonstrate the capability of the proposed method to jointly leverage the 
multi-view information in the overlapping image regions. 

Fig. 6.5 depicts the impact of the asymmetric image similarity parameter 
a € [1,2], introduced in (5.26), on the precision-recall performance. For 
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a = 1 the image similarity function is equal to the L1-norm, while for 
a > 1 the observed pixels not explained by the rendering of the generative 
scene model are penalized proportional to a. Consequently, increasing 
a leads to better recall values while potentially decreasing the precision. 
In this evaluation, a decrease in precision can be observed for a > 1.4. 
Comparing the performance of the symmetric image similarity (i.e. a = 1) 
with the performance of the asymmetric image similarity for a = 1.3,a 
significant increase in performance can be observed, evidenced by the best 
F1-Score of 0.91 compared to 0.95, respectively. 


6.4.1.2 Qualitative Results 


Fig. 6.6 and 6.8 show exemplary mean-field optimization results after five 
iterations. The final marginal probability map is projected onto the ground 
plane, where purple corresponds to a probability of zero and yellow to 
one respectively. 

Fig. 6.6 highlights some exemplary flawless results. The given sam- 
ples illustrate that our approach is able to resolve challenging scenarios, 
suffering from occlusion and measurement noise, by making use of the 
full multi-view image evidence. Fig. 6.6(b) gives a particular example 
for the ability to handle occlusion by exploiting the image evidence from 
multiple sensors. While the fifth individual is fully occluded on the first 
view (left image in Fig. 6.6(b)), the second view and partially also the third 
view include enough image evidence of the fifth individual to predict a 
probability of occupancy close to one. Fig. 6.6(c) shows that the resulting 
marginal distributions also include some uncertainty around the peaks of 
the distribution. This happens in general if the generative scene model 
cannot precisely explain the observations, e. g. due to discretization errors, 
measurement noise or the simplified 3D person model (cf. Sect. 4.2.1). 
However, for the majority of samples the resulting marginal distribution 
of people present in the scene contains clear peaks with a high confidence. 
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Figure 6.6 Exemplary mean-field optimization results g(x) (without temporal context). (a) 
and (b) show an estimation of the marginal probability distribution q(x) with clear peaks 
at grid locations occupied by a person. (c) includes some uncertainty near the peaks of the 
distribution. 
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Figure 6.7 Comparison of asynchronous and synchronous mean-field updates considering 
the best achieved F1-score in each mean-field optimization iteration. 


In Fig. 6.8 two mean-field optimization results, including some exem- 
plary faulty detections are shown. Fig. 6.8(a) shows a typical false negative 
case on the image border, which is the dominant error class occurring in 
the data set. Due to the stereo-vision sensors, the depth information is 
more noisy on the image border, eventually leading to an insufficient fit of 
the 3D model. To overcome this limitation in future work, a richer proba- 
bilistic sensor model which takes systematically varying noise into account 
could be employed. Two similar false negative cases can also be observed 
in Fig. 6.8(b) in the third view. Apart from that, Fig. 6.8(b) includes a false 
positive detection, indicated by a blue dashed circle. This type of false pos- 
itive potentially occurs between two individuals due to the discretization 
error and the simplified 3D person model. However, regarding the present 
data set, false positives are the minority errors occurring in the evaluation. 

In Fig. 6.9, the iterative mean-field optimization process is illustrated 
for one exemplary frame, for both the asynchronous and the synchronous 
update strategy (cf. Sect. 3.1.4). In Fig. 6.9(b) the asynchronous mean-field 
update optimization is illustrated. One can observe that after the first 
optimization iteration the probability mass is already quite concentrated 
around the grid locations occupied by a person. In iterations 2 and 3 the 
probability mass gets more concentrated, leading to a marginal distribu- 
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Figure 6.8 Exemplary mean-field optimization result g(x), including faulty detection re- 
sults. (a) includes a typical false negative, marked with a red dashed circle. (b) includes a 
false positive, marked with a blue dashed circle as well two false negatives. 


tion with clear peaks. In contrast, Fig. 6.9(c) depicts the optimization for 
synchronous mean-field updates. The example illustrates typical prob- 
lems of synchronous mean-field updates. In contrast to the asynchronous 
update, the probability mass is distributed on more grid cells after the first 
iteration. Also in the upcoming iterations the simultaneous optimization 
suffers from oscillating marginal probabilities, evidenced by the grid loca- 
tion u;[2, 6], shown as red curve in Fig. 6.9(c). Considering the overall best 
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(a) Final mean-field result for asynchronous mean-field update 
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(c) Synchronous mean-field update 


Figure 6.9 Evolution of asynchronous and synchronous mean-field updates. In the left- 
hand plots of (b) and (c), every path corresponds to the probability evolution of one q;(z;). 
The probability evolution of six grid locations of interest are plotted in unique colors, the 
others are plotted in purple. The right-hand plots show the same process illustrated as 
probability maps for the first four iterations. 
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Table 6.2 Performance of temporal mean-field approximations. 


AUC Fl-score Precision Recall 


MF-VI Temporal Smoothing 0.94 0.94 0.96 0.92 
MF-VI Bayesian Filtering 0.96 0.93 0.95 0.90 
MF-VI Data Term 0.95 0.91 0.92 0.90 
DoG-Detector 0.85 0.83 0.77 0.92 


F1-score per iteration, Fig. 6.7 shows that the asynchronous update strategy 
clearly outperforms the synchronous update strategy. While asynchronous 
mean-field optimization converges after only few iterations, synchronous 
mean-field update suffers from the oscillating effects mentioned above. 
In the remainder of this evaluation, the asynchronous mean-field update 
strategy is used. 


6.4.2 People Detection with Temporal Context 


The results in the previous section are based on typical frame-by-frame 
detection to be comparable to state-of-the art approaches. In contrast, 
in this section we will focus on the evaluation of the additional impact 
of exploiting the temporal context. Hence, we will evaluate the mean- 
field variational inference approximations for (i) the Bayesian filtering 
distribution p(x; | 04.7) as proposed in Sect. 5.2.2; (ii) the full posterior dis- 
tribution p(x,.7 | 01.7) (referred to as temporal smoothing) as introduced 
in Sect. 5.2.3. This section is based on the results previously published in 
[109]. 


6.4.2.1 Quantitative Results 


In Sect. 6.4.1 we report quite strong detection performance for the mean- 
field approach omitting the temporal context (MV-VI Data Term), evi- 
denced by the best F1-score of 0.95 and an AUC of 0.98. Since these results 
are nearly reaching the optimum, we evaluate the exploitation of the tem- 
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Figure 6.10 Precision-recall curves showing the performance of our approach with and 
without temporal context. 


poral context on a slightly more challenging? disjoint image sequence. The 
sequence consists of 300 consecutive multi-view frames, which are a sub 
sequence of the MULTIPLE data set introduced in Sect. 6.1. As a conse- 
quence of the more challenging sequence the performance of the MV-VI 
Data Term drops to best F1-score of 0.91 and an AUC of 0.95 (cf. Fig. 6.10). 


Subsequently, we analyze the effects of using the temporal context by 
comparison of the following methods: 


= MF-VI Temporal Smoothing refers to the approximation of the 
full posterior distribution p(x .7 | 01.7) as introduced in Sect. 5.2.3. 
During evaluation, we observed that the influence of the proposed 
future term in (5.53) on the quantitative results is negligible for the 
proposed update strategy and motion model. Therefore, we set 


Compared to the sequence used for the single time step case, the sequence contains slightly 
more individuals present at the image border, making people detection more challenging. 
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the weighting parameters to Apast = 0.65, Afuture = 0.0. In conse- 


as 
quence the presented results ar based only on the past and data 
term defined in (5.46), (5.50) respectively. Assuming only moderate 
movement of individuals between two consecutive frames, the pa- 
rameter Isa” in the dynamics model (4.40) is set to sef = 0.8. The 
normalization weights are set according to (4.42). For evaluation, 


we run six mean-field iterations. 


MF-VI Bayesian Filtering refers to the approximation of the filter- 
ing distribution p(x, | 04.7) as proposed in Sect. 5.2.2. The dynamics 
model parameter is set to Hsef = 0.2. As for the temporal smoothing 


method we perform six mean-field iterations for evaluation. 


MF-VI Data Term refers to the approximation of the posterior dis- 
tribution p(x, | o+) limited to the observations from a single time 
step and a uniform prior, as introduced in Sect. 5.2.1 and evaluated 
extensively in Sect. 6.4.1. 


DoG-Detector To put the results into perspective we compare them 
with the Difference of Gaussian blob detector as introduced in 
Sect. 6.3. 


Fig. 6.10 shows a comparison of the precision-recall performance for the 


methods mentioned above. Although the performance of the mean-field 


variational inference approach without temporal context (MF-VI Data 


Term) is already quite high (best F1-score of 0.91), the results show that 


the exploitation of the temporal context can increase the overall precision 


and recall performance, evidenced by a maximum Fl-score of 0.94 for the 


MF-VI Temporal Smoothing approach and 0.93 for the MF-VI Bayesian 


filtering approach, respectively. Considering that the temporal smoothing 


approach is a batch processing method, the full sequence of observations is 


optimized jointly. In contrast, the Bayesian filtering distribution is defined 


recursively, where the past is condensed in the mean-field approximation 


2 Reflecting that a person will stay on the current location with a probability of iar. 
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Figure 6.11 Dependence on hyperparameter a evaluated for MF-VI Data Term and MF-VI 
Temporal Smoothing. 


of the previous time step d(x;_,). In practice, this enables to use the 
Bayesian filtering approximation as a classical frame-by-frame detector 
in real-time applications. From this perspective, it is remarkable that the 
Bayesian filtering method does exhibit a similar performance on the given 
data set compared to the temporal smoothing approach. Still, the results 
must be interpreted carefully, since the current data set is limited and the 
mean-field approximation without usage of any temporal context already 
performs quite well. Our hypothesis is that in order to capture the full 
capabilities of the temporal approaches, evaluation on a more challenging 
large-scale data set is required. 

In Fig. 6.11, the best F1-score depending on o is plotted. The results in- 
dicate that the temporal smoothing is slightly less sensitive to the choice of 
the image similarity weighting parameter a, compared to the MF-VI Data 
Term approach. The temporal regularization might be able to compensate 
for the suboptimal single frame detections. 


6.4.2.2 Qualitative Results 


Fig. 6.12 and Fig. 6.13 illustrate exemplary comparative mean-field opti- 
mization results with and without using the temporal context. In Fig. 6.12 


an exemplary multi-view frame leveraging the temporal regularization is 
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ations at one time step from three sensors (multi-view frame) 


(a) Input depth observ. 


> gu ar y 


] p 


(b) No temporal context: MF-VI Data Term 


Wu S 


(c) With temporal context: MF-VI Temporal Smoothing 


v 


m, -ia ES 
(d) With temporal context: MF-VI Bayesian Filtering 
Figure 6.12 Exemplary mean-field optimization results depicted for one multi-view frame 


(a). (b,c,d) show the resulting marginal probability map projected onto the ground plane. 
False negatives are marked with a red dot. 
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(c) With temporal context: MF-VI Bayesian Filtering 


Figure 6.13 Mean-field results for five consecutive frames, projected into sensor view one. 


shown. Without the temporal context (Fig. 6.12(b)), the estimated marginal 
probability distribution contains high uncertainty near the two targets due 
to partial visibility and measurement noise, leading to a false negative 
detection in both cases. Exploiting the temporal context can resolve those 
uncertainties, leading to a marginal distribution with clean peaks for both, 
the MF-VI Temporal Smoothing (Fig. 6.12(c)) and the MF-VI Bayesian 
Filtering approach (Fig. 6.12(d)). 

Similar effects can be observed in Fig. 6.13, where results for an exem- 
plary sequence of five consecutive frames (shown only in sensor view 
one) are depicted. Without exploiting the temporal context (Fig. 6.13(a)), 
detection of the person at the bottom right-hand side is very unstable over 
time. In contrast, by applying the temporal smoothing these predictions 
are getting stabilized as evidenced by Fig. 6.13(b). In this sequence, the 
MF-VI Bayesian Filtering approach does not perform quite as well as the 
MF-VI Temporal Smoothing method. Studying the first two frames, the 
filtering approach does not improve the detection results compared to the 
MF-VI Data Term (Fig. 6.13(a)), while the temporal smoothing clearly does. 
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However, once there is some significant probability mass predicted by the 
data term (as observed in Fig. 6.13(a), frame three), the prediction is stable 
in the upcoming frames four and five as evidenced in Fig. 6.13(c). This 
reveals a general difference between the temporal smoothing and the tem- 
poral filtering approach. While in the temporal smoothing approach, the 
marginal distributions are approximated jointly over space and time, in the 
filtering approach the past is condensed in the approximated distribution 
from the previous time step. Due to the joint optimization, the tempo- 
ral smoothing approach effectively incorporates temporal information in 
order to fill in the gap of noisy predictions. 

We observed that the run time per frame decreases slightly on average 
by using the temporal context. This can be explained by the fact that the 
proposed dynamics model effectively restricts the set of grid cells where 
a person can be present with a probability greater than zero, thus less 
mean-field updates need to be evaluated. On a single CPU core”, our 
non-optimized Python implementation needs approximately 700 ms per 
multi-view frame. 


6.5 CNN Inference 


In this section we evaluate the end-to-end multi-view CNN approach 
introduced in Sect. 5.3. We investigate whether the proposed black box 
inference method, trained with synthetic data only, achieves comparable 
results to the more involved mean-field variational inference methods. In 
comparison to the probabilistic approaches, the proposed CNN architec- 
ture is trained with images obtained by the modified” generative scene 
model (cf. Sect. 4.2.1). Thereby, the same discrete ground plane grid is 
used to encode the prediction results. Since the proposed CNN operates 
on a multi-view frame without considering the temporal context, we com- 


$ Intel Core-i7 with 2.9 GHz 


As described in Sect. 5.3.2, the depth image output of the generative scene model is 
randomized. 
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(a) Best F1-score vs epochs (b) Training loss 
Figure 6.14 Exemplary training process of the multi-view CNN architecture with 150.000 
synthetic multi-view frames, sampled from the randomized scene model. (a) shows the best 


F1-score achieved after each epoch on the MULTIPLE data set as well as on a subset of the 
synthetic training data. (b) shows the corresponding training loss on a logarithmic scale. 


pare it to the single frame detection method MF-VI Data Term. For the 
experiments we use the same sequence of 2200 frames as in the evaluation 
of the MF-VI Data Term (Sect. 6.4.1). Since we observed only a slight drop 
in performance when the input depth images are downscaled by a factor 
of 2, we used subsampled depth images with a resolution of 188 x 120 
pixel as input. This has the advantage that the GPU memory footprint can 
be reduced significantly, leading to a faster training process. The results in 
this section are based on the experiments previously published in [112] by 
the author. 

To train the model, we use 150.000 synthetic multi-view frames (con- 
sisting of three depth images each). For the randomized scene model 
we sample each frame accordingly to Algorithm 3. For the plain scene 
model we use the same scene configuration sampling strategy, but omit 
the manipulation of the vertices. We train the model with a batch size 
of 96 and use the Adam [56] optimizer with a learning rate of 0.001. An 
exemplary training process is illustrated in Fig. 6.14. 
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Figure 6.15  Precision-Recall curves showing the performance of the multi-view CNN 
approach with and without domain randomization. 


6.5.1 Quantitative Results 


Fig. 6.15 shows the precision-recall performance of the evaluated ap- 
proaches. While the mean-field variational inference approach (MF-VI 
Data Term) outperforms the other methods, the proposed end-to-end CNN 
method trained with synthetic randomized depth images (CNN (random- 
ized scene model)) achieves noticeable results with best F1-score of 0.88. 
To put these results into perspective, one has to consider that the CNN 
architecture is a black box inference method only trained with synthetic 
training data and does not incorporate any further scene knowledge or 
explicit modeling. Also, the MF-VI Data Term includes an iterative op- 
timization method at inference time. In contrast, the multi-view CNN 
approach is computationally evolved during training, however inference 
is just a forward pass. Still, the performance of the CNN approach is not 
competing with the more involved mean-field inference. Comparing the 
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two manifestations of the proposed architecture CNN (plain scene model) 
and CNN (randomized scene model), shows that randomizing the scene 
model has a significant impact on the performance (F1-score of 0.76 for 
plain person model vs. F1-score of 0.88 for randomized person model). 
This is a noticeable result, since the randomization of the person model 
as proposed in Sect. 5.3.2 does not include any specific scene knowledge. 
Still, it leads to a better generalization of the trained model by narrowing 
the gap between synthetic images and real-world observations, resulting 


in a significant increase in detection performance. 
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6.5.2 Qualitative Results 


Figure 6.16 Exemplary inference results for the multi-view CNN trained on randomized 
synthetic depth images (CNN (randomized scene model)). 


In Fig. 6.16 qualitative results of the proposed CNN architecture trained 
with randomized synthetic depth images are illustrated for two exem- 
plary multi-view frames. Fig. 6.16(a) shows an exemplary convincing 
example with clear peaks at grid locations occupied by a person. In con- 
trast, Fig. 6.16(b) shows a particular faulty example. The probability mass 
is spread around the grid cells occupied by an individual, however the 
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(b) CNN (randomized scene model) 


Figure 6.17 Results for a sequence of four consecutive multi-view frames (for visualization 
purpose only sensor view one is depicted). (a) shows the marginal probabilities obtained 
by mean-field variational inference. (b) depicts the marginal probabilities predicted by the 
multi-view CNN approach. 


marginal probabilities are all quite low. In consequence, none of them 
clearly predicts the presence of a person, leading to false negative detec- 
tions. Comparable false negative cases are the dominant error cases in the 
present data set. 

Fig. 6.17 shows the direct comparison of four consecutive frames, pro- 
jected into sensor view one. This example shows a significant weakness 
of the proposed CNN end-to-end inference. The results of mean-field 
variational inference (without using the temporal context) are quite ro- 
bust over time as evidenced by Fig. 6.17(a). In contrast, the predictions 
of the CNN architecture are very unstable over the given sequence of 
consecutive frames. Even the appearance of individuals hardly change 
in between two consecutive frames, the predicted marginal probabilities 
significantly differ. For example, the grid cell occupied by the person at the 
top right-hand side exhibits a high probability in the first frame. However, 
in the second frame the probability drops significantly, leading to a false 
negative detection. 
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6.6 MAP Inference in Continuous Latent Space 


In this section we report results for the MAP inference method introduced 
in Sect. 5.1.1. A fair comparison of the MAP inference method with the 
methods evaluated in the previous section is not possible due to the fol- 
lowing reasons: (i) in order to formulate the MAP objective, the number of 
people in the scene has to be known a priori; (ii) the underlying non-linear 
least squares optimization problem relies on a good initialization. As a 
consequence, the MAP inference is not suitable as a stand-a-lone method 
in real-world scenarios, whereas it can be seen as a complementary method 
to the proposed discrete inference methods, enabling fine-tuned discrete 
person localization. To illustrate the capabilities of the MAP inference 
approach, we present qualitative results in this section. In Sect. 6.6.1 we 
report results obtained by frame-by-frame inference, while in Sect. 6.6.2 we 
apply the MAP inference method to a sequence of consecutive multi-view 


frames. 


6.6.1 Optimization for One Time Step 


In this section, we show MAP optimization results omitting the temporal 
dynamics as defined in the objective (5.9). The box regularization term is 
weighted with Ayox = 0.1 and the pairwise distance regularization term 
with Ag; = 1 respectively. The number of people present in the scene 
is explicitly given as the ground truth for each multi-view frame. If not 
reported otherwise, the initial scene configuration X = (Xj,...,X,,) is 
drawn randomly. It is assumed that each person location x; € R? is drawn 
independently and distributed uniformly in a rectangular initialization 
area. By narrowing the initialization area to a reduced edge length of 70% 
of the rectangular observable area, initial person locations close to the im- 
age border are avoided. Fig. 6.18 shows a flawless MAP inference example 
for one frame of the MULTIPLE data set. Notice that the individual at the 


10 The result is cherry-picked from 10 optimization results, obtained by different randomly 
drawn initial scene configurations. 
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(c) 


Figure 6.18 Optimization result for the MAP objective without temporal dynamics. (a) 
shows the depth observations with the synthetic depth images corresponding to the initial 
scene configuration as green overlay. (b) shows the synthetic depth images corresponding 
to the final MAP result as blue overlay. The initial positions are given as green dots, the 
optimization trace is drawn in purple. (c) illustrates the final optimization result projected 
into the camera view of each sensor. 


bottom right of sensor view one is not considered to be part of the ground 
truth since itis only hardly visible in a single sensor view. In Fig. 6.18(a) 
one can see that the initial person models in the synthetic depth images are 
already partially overlapping with the depth observations of the individu- 
als present in the scene. In this case, for each location X; some gradients 
point in the direction of an individual, which increases the probability of 
converging to a satisfying local optimum (cf. 5.1.1). 
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Figure 6.19 MAP optimization results for one multi-view frame and five different initial 
scene configurations (a)-(e). Per scene configuration initialization from left to right: (left) 
depth observation of sensor one with initial scene model as green overlay ; (middle) final 
MAP state of scene model as blue overlay and optimization trace in purple; (right) final MAP 
results projected into the camera view one. 
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To emphasize the reliance on a good initialization for the proposed con- 
tinuous optimization method Fig. 6.19 illustrates results for five randomly 
drawn initial scene configurations. For all independent optimization runs 
Fig. 6.19 (a)-(e), the given observations can only be partially explained by 
the model due to a suboptimal initialization. Fig. 6.19(a) shows a particular 
ill-posed initialization. Four of the five person locations are initialized 
close to each other at the bottom of the observable area. After optimization 
one of those person locations successfully explains the individual at the 
bottom left in the scene. However, the remaining person locations are 
finally located at the border of the scene. This can be explained as follows: 
If a rendering of an individual does not share any overlap with the image 
evidence of an individual, the data term costs will only be reduced if the 
synthetic person model moves out of the field of view. During optimiza- 
tion affected person locations move towards the image border until the 
box prior creates an opposite effect. This behavior can be observed in all 
given samples in Fig. 6.19 to some extent. 

In Fig. 6.20 the combination of the MAP inference method with the mean- 
field variational inference approach introduced in Sect. 5.2 is illustrated. 
Fig. 6.20(a) shows the MF-VI result for a single multi-view frame. Applying 
a threshold to the marginal probabilities yields a list of detections, used 
to initialize the continuous optimization problem. By solving for the 
MAP state, the initial discrete grid based person locations get fine-tuned, 
leading to more precise locations on the ground plane. In Fig. 6.20(b) the 
3D person models corresponding to the MF-VI initialization do already 
fit the observations quite well. However, studying the overlay image 
of the final MAP results in Fig. 6.20(c) reveals that the model fit can be 
further improved. While these improvements are limited due to the chosen 
discrete grid with a horizontal and vertical distance of 33cm between 
adjacent grid points (cf. Sect. 6.4), the fine-tuning can be a crucial part 
for applications demanding a precise localization in image coordinates. 
Another potential application is to run the MF-VI detector only for every 
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(d) 


Figure 6.20 Exemplary results of MAP inference initialized with the mean-field detection 
results given in (a). (b) shows the depth observations with the synthetic depth images 
corresponding to the initial scene configuration as green overlay. (c) shows the synthetic 
depth images corresponding to the final MAP result as blue overlay, whereas the initial 
positions are given as green dots and the optimization trace is drawn in purple. (d) illustrates 
the final optimization result projected into the camera view of each sensor. 
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k-th frame and apply MAP inference to the gap frames, initialized with 
the last MF-VI result. 


6.6.2 Sequence Optimization 


The results in the previous section are based on frame-by-frame infer- 
ence. In contrast, in this section we apply the full MAP objective given 
in (5.8) to a sequence of consecutive multi-view frames of the MULTIPLE 
data set. The additional temporal term is weighted with Atemporal = 0.01. 
To initialize the sequence of scene configurations X 1.7, we first draw 
one time independent random scene configuration X = (x... x) 
from a uniform distribution as described for the single frame inference in 
Sect. 6.6.1. Based on the initial scene configuration X", each initial person 
location X; , fori € {l,...,m} and t € (1,..., T}, is independently drawn 
from the normal distribution N ee e^) , with o = 0.7 m. This ensures 
that the initial scene configuration is not ill-posed and huge jumps of an in- 
dividual from one time step to the other are avoided. At the same time the 
proposed initialization exhibits enough randomness to explore the scene 
configuration space, in order to increase the probability of successfully con- 
verging to an optimal fit. Our experiments showed that by initializing each 
X; , independently, the temporal regularization costs will dominate the 
data term costs, eventually leading to an ill-posed optimization problem. 
The major strength of the additional temporal regularization is that the 
optimization becomes more independent of a particular initialization. This 
effect can be observed in Fig. 6.21, where the optimization process for six 
consecutive multi-view frames is illustrated. Applying the proposed joint 
optimization to this sequence leads to a satisfying fit for all time steps. In 
contrast, solving for the scene configuration for a single time step (while 
using the same initialization) is not successful for t = {3,4,5}. This is 
exemplarily evidenced for t = 4 in Fig. 6.22. The MAP scene configuration 
is successfully determined by considering this multi-view frame as part of 
a sequence of frames, as shown in Fig. 6.22(a). In contrast, using the same 
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Sensor 1 


Sensor 3 
x~ 


5 


Sensor 2 


Figure 6.21 MAP optimization result for a sequence of six consecutive multi-view frames. 
Each row corresponds to one time step. For a explanation of the overlay see description 
Fig. 6.18. 
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(c) Initialization 


Figure 6.22 Comparison of optimization results for frame at t = 4 (cf. Fig. 6.21), with (a) 
and without temporal context (b). Both optimizations are initialized with the same scene 
configuration, visualized in (c) as green overlay. 
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initialization (illustrated in Fig. 6.22(c)) but omitting the temporal context, 
the optimization process gets stuck in a local minimum far away from the 


true scene state, see Fig. 6.22(b). 


6.7 Discussion 


The results reported in Sect. 6.4.1 demonstrate that even without using the 
temporal context our proposed mean-field variational inference approach 
achieves strong detection performance, outperforming state-of-the-art 
monocular multi-view people detection methods. Considering that POM 
operates only on binary input images, it is remarkable that POM achieves 
such competitive results. On the one hand this shows the strength of the 
general idea of generative probabilistic modeling in combination with the 
mean-field approximation for multi-view people detection. On the other 
hand our hypothesis is that the difference in performance between POM 
and our approach would be more significant in more complex real-life 
scenarios, where the foreground images include additional objects and 
a higher level of noise. Besides the direct comparison to the literature, 
the results also indicate that our joint optimization approach effectively 
leverages the given multi-view information compared to a single-view 
approach, as evidenced by Fig. 6.4. This claim is also supported by the 
reported performance of the DoG-Detector, which operates independently 
on the foreground segmented depth observations of each sensor. 

The advantages of our generative joint probabilistic approach are also 
supported by the direct comparison with the proposed end-to-end multi- 
view CNN architecture. While the reported results are noticeable, consid- 
ering that the neural network is only trained with synthetic images, the 
overall performance of the proposed CNN is not competitive. This is an 
interesting result, since both approaches are based on the same generative 
scene model (except the additional randomization) and do not use any 
further supervision (e. g. labeled real-world image data). Thus, the CNN 
approach can be interpreted as black box inference method, replacing the 
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more involved, hand crafted mean-field variational inference optimization. 
We emphasize that while the former is a discriminative method the latter 
is a generative method. In view of the reported results, we can conclude 
that for multi-view people detection in overlapping depth images it is 
worth to take the extra effort to deduce the mean-field equations. In spite 
of the reported results, we believe that by leveraging a large scale training 
data set, it might be possible to achieve competitive results with a compa- 
rable CNN architecture. However, a suitable large-scale data set covering 
multi-view people detection in top-view depth images does not exist yet. 

Even though the mean-field approximation applied to a single frame al- 
ready achieves quite strong performance, the results reported in Sect. 6.4.2 
indicate that leveraging the temporal context can further increase the 
detection performance. On the test sequence the best performance was 
achieved by the temporal smoothing approach, directly followed by the 
Bayesian filtering approach. Considering the rather moderate quantitative 
performance improvements on an evaluation sequence of limited length, 
the results have to be interpreted carefully. Taking also the qualitative 
results into account, we can observe several sub-sequences where, com- 
pared to the single frame method, the additional temporal information 
clearly improves the prediction quality. Our hypothesis is that we could 
observe these effects in a higher frequency on a more challenging data set, 
including a more cluttered foreground segmentation. 

From a practical point of view, the proposed mean-field Bayesian fil- 
tering method might be the most promising approach. The temporal 
smoothing approach achieves the best performance by jointly optimizing 
over a sequence of temporal views. This has many theoretically appeal- 
ing aspects and might also be sufficient for some applications. However, 
for many real-world applications batch processing” is often impractical. 
Therefore, we favor the approximation of the Bayesian filtering distribu- 
tion for all applications with real-time requirements. Due to the recursive 


" In contrast to frame-by-frame processing, batch processing refers to a detection method 
taking a sequence of multi-view frames as input. 
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definition of the posterior distribution, the method fulfills the require- 
ments of a frame-by-frame detector while almost reaching the detection 
performance of the temporal smoothing approach. Nonetheless, due to a 
frame processing time in the order of one second, we have to admit that 
the current implementation of our approach is not yet real-time capable. 

Additionally, we report results for the MAP inference method, based 
on a continuous latent space. Compared to the exhaustive evaluation of 
the different manifestation of mean-field approximations, the explanatory 
power of the MAP inference evaluation is limited. The qualitative results 
show that for a sufficient initial scene configuration the continuous opti- 
mization eventually converges, resulting in a good approximation of the 
MAP state. However, the continuous optimization highly depends on an 
adequate initialization. Hence, we see the potential for the MAP inference 
as a post processing step, to fine-tune rather coarse detections. In practice 
this can be useful for offline applications with high-precision requirements, 
e. g. automatic annotation of novel data sets, where initial coarse detections 
could be given by a human annotator or a detection algorithm such as the 
proposed mean-field variational inference approach. Since this is a purely 
offline application, with no real-time requirements, itis also a particular 
example where the MF-VI temporal smoothing approach would be the 
favorable choice over the Bayesian filtering method. 
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7.1 Conclusion 


In the present thesis we have addressed the problem of probabilistic multi- 
view people detection in overlapping depth images. In particular, we have 
investigated methods making joint use of the temporal multi-view image 
evidence. Therefore, we have presented a comprehensive generative prob- 
abilistic framework, recasting the problem of multi-view people detection 
as an inverse problem. The core of this model is a generative scene model, 
which maps a scene configuration to a synthetic depth observation. The 
generative approach effectively handles the different appearances of peo- 
ple, leading to a view-point agnostic detector. Since the generative scene 
model is a function of the projection matrix of each sensor, our framework 
makes it easy to incorporate a new sensor modality. The only requirement 
is the definition of an adequate sensor forward model. Based on the pro- 
posed generative probabilistic model, we have presented several inference 
strategies. 

For continuous latent space, we have deduced the MAP objective and 
solved the resulting non-linear least squares optimization problem by 
leveraging approximate differentiable rendering. While this approach 
hinges on an adequate initialization, our experiments demonstrate that 
the MAP inference can be used complementary to the presented methods 
utilizing a discrete latent space, serving as a fine-tuning post-processing 
step. 

Instead of just estimating a MAP point estimate, we have further in- 
vestigated mean-field variational inference methods to jointly exploit the 
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multi-view information in order to approximate the probability distri- 
bution of people present in the scene. We have deduced the mean-field 
update equations for the data term and proposed a novel strategy to ef- 
ficiently approximate the final mean-field expectations by leveraging a 
pre-computed visual dictionary. For evaluation, we have introduced the 
novel data set MULTIPLE. The data set is publicly available and in par- 
ticular covers indoor people detection in overlapping depth images from 
the top-view. Our experiments have shown state-of-the-art results on the 
MULTIPLE data set. Even without using the temporal context, we have 
demonstrated that our approach achieves strong detection performance, 
outperforming state-of-the-art monocular multi-view people detection 
methods. We were also able to show that using multi-view image evi- 
dence increases the detection performance significantly compared to a 
single-view. 

Furthermore, we have introduced a novel temporal extension of the 
presented mean-field approach, which leverages the temporal context to 
regularize the stochastic mean-field optimization process. In particular, 
we have proposed a grid based dynamics model to describe the flow of 
probability mass over time, enabling the definition of the joint distribution 
of people present in the scene across space and time. Based on the joint dis- 
tribution, we have deduced the mean-field equations for the full joint dis- 
tribution (temporal smoothing) as well as the recursively defined Bayesian 
filtering distribution. Our results show that for the temporal smoothing 
as well as the Bayesian filtering approach, the introduced temporal reg- 
ularization leads to a more robust estimation of the desired probability 
distributions, and in consequence increases the detection performance. 

For a direct comparison with the proposed probabilistic inference meth- 
ods, we have additionally introduced an end-to-end multi-view CNN 
framework. In contrast to prevalent methods in the literature, the CNN 
architecture is trained only on synthetic depth images, sampled from the 
(randomized) generative scene model. Although we have reported notice- 
able results for the proposed CNN framework, the overall performance 


142 


7.2 Future Work 


is not competitive with the proposed mean-field approximations. These 
results support the claim that, for multi-view people detection in depth 
images without a large-scale labeled data set, it is worth making use of 
more involved joint generative probabilistic modeling methods. 

In summary, we have introduced a probabilistic framework for indoor 
people detection in top-view depth images. The overriding goal has been 
to leverage the temporal multi-view image evidence from all depth sensors 
jointly to resolve occlusion as well as measurement noise. We have demon- 
strated that the proposed mean-field methods effectively approximate the 
joint probability distribution of people present in the scene, leading to 
state-of-the-art detection performance. 


7.2 Future Work 


The methods presented in this thesis open up a variety of future research 
directions. While in this work, we focused on a homogeneous network of 
passive stereo depth sensors, it would be appealing to integrate different 
types of sensor modalities into the sensor network. It would be straight- 
forward to integrate other types of depth sensing devices, such as active 
stereo sensors or time-of-flight cameras. However, as long as an appro- 
priate forward model for the sensor modality can be defined, any kind 
of sensor can be integrated into the presented frame work. Considering 
indoor people detection, this could, for example, include monocular video 
cameras, thermographic cameras or even light barriers. 

Besides the integration of new sensor modalities, the proposed forward 
model for depth sensor could be extended in several ways. A distance and 
viewpoint dependent noise model could prevent false negative detections, 
occasionally occurring on the image borders due to heavy measurement 
noise. Second, a more expressive generative scene model could be em- 
ployed. For the sake of a low dimensional latent space, in this thesis, a 
simple rotationally symmetric 3D person model is used. However, one 
could also extend the latent space and model each individual with a mor- 
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7 Conclusion and Future Work 


phable 3D person model, for example incorporating degrees of freedom 
for height, body size, body shape, and rotation. 

From a methodological perspective, there are several opportunities to 
advance the proposed mean-field optimization. For the approximation of 
the data mean-field update expectation, we assume that only the direct 
neighborhood of a grid location affects the expectation. While this assump- 
tion is valid for the top-view, for a frontal viewpoint, a more sophisticated 
approximation would be preferable. This could, for example, be obtained 
by ray-tracing techniques or by checking for the intersection of the bound- 
ing boxes with each other. However, the complexity of the approximated 
mean-field update expectation increases exponentially with the number 
of neighbor cells considered. Focusing on the run-time issues in real-time 
applications, an open question to research is the (partial) parallelization of 
mean-field updates. The challenge here is to employ parallel mean-field 
updates, while exhibiting convergence properties similar to the proposed 
asynchronous mean-field optimization. 

While our experiments showed that the overall detection performance 
can be improved by using the temporal context, there is a great potential 
to extend this further. For example, the proposed dynamics model consid- 
ers the distribution of individuals to be conditionally independent, once 
the previous scene configuration state is given. This assumption makes 
the problem computationally tractable, but it also significantly limits the 
expressiveness of the dynamics model. Therefore, a future research direc- 
tion could be the investigation of different factorizations of the transition 
distribution. In particular, it would be tempting to employ transition dis- 
tributions with a pairwise factorization, promising a significant increase in 
expressiveness while still being tractable. Moreover, it would be engaging 
to consider more powerful motion models, incorporating the direction 
and speed of individuals. 

Overall, to assess the extensions proposed above, a more extensive 
quantitative evaluation would be desirable. Therefore, a large scale data 
set for multi-view people detection in overlapping depth images is needed. 
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7.2 Future Work 


In future work, the MULTIPLE data set could be extended by various 
real-world scenarios, including customer behavior analysis in retail stores 
or people tracking at an airport. 

A promising, fundamental future research direction will be the method- 
ical combination of generative probabilistic modeling with state-of-the-art 
data-driven CNN architectures. While this idea has already been proposed 
in the literature by [9, 10] in the context of multi-view people detection 
with monocular video sensors, we believe that it is a forward-looking 
idea and in general will become increasingly interesting for the computer 
vision community in the upcoming years. 


145 


Appendix 


A Derivation of Mean-Field Update 
Equations 


A.1 General Mean-Field Equations 


The following derivation is based on [70, p. 736 ff.] and [14, p. 465 
ff.]. To keep the notation simple, we use f f(X)dA as short hand for 
So Sp F(X) dai, da’, and f f(4) dA No; as the integral over all vari- 
ables (2; : i Æ j} respectively. The idea is to derive the update equation 
for a single distribution q, by isolating the dependency in the ELBO £(g). 
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A Derivation of Mean-Field Update Equations 


Assuming the elements (q, : i # j } stay fixed, the ELBO for one single 
distribution can be derived as 


Lg) = (108 P(X,O) — log. (4) yx) (Aa) 


ba n (4,0) - 2 bul Th - (A.1b) 


ol 
(A.1c) 
= f ala D) Mac \logp(& , O) dX Nr da’, (A.1d) 
j#t 
log ĵi (X ,O)=(log ALO) (n 
EE Mac, (Sh cj.) + log q; (a: ) dX Nr; dz; 
JA kzi 


2 a (21) log (A, O) del — i as(a/) log q, (21) dx, + const (A.1e) 


p; (X 
= fæ u) Gs dx; + const. (A.1f) 


f 


qi(zi) 


Up to a constant factor the ELBO £(g,) is equivalent to the negative KL 
divergence, thus we can re-write the objective as 


£(qi) = —KL(qi || Di), (A.2) 
with 


P(X, 0) = 7 exp((logr(¥,O))q(x\2!)): (^3) 


i 
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A.2 POM Mean-Field Equations 


It follows that the optimal solution for the objective ĝ;(x;) = arg max L(q;) 


is given as 


alas) = Z exv(ogp( OY) qr). (A4) 


uv 


A.2 POM Mean-Field Equations 


Inserting the probabilistic model of POM to the mean-field expectation we 
obtain the unconditioned expectation 


(log p(B, X)) qa.) = (log p(b | X)P(X)) (aa, (A.5) 
C 

= Sosa (bos Ac(X))) aa) (A.6) 
c=1 


+ (log px) qa) - 
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A Derivation of Mean-Field Update Equations 


By inserting (A.6) in (3.36), the POM mean-field equations are given as 


exp (log p(b, x |z; = 14642.) 


4; = 1) = (A.7a) 
2 sc10,1) exp ((log p(b, x |x; = 5) ao) 
= i + exp (og ib, x | z; = 0)) Gas) (A.7b) 
-1 
- (logp(b,x |2; = Mage) | 
= h + F le = 0)) say (A.7c) 
G 
= 2 (rom (der Ala; = 0) are) 
e=1 
ne | 2, = ee 
2d 
F Lo pom (5c, Ac(xlz; = Dees) 
D p(x | x; = 0) 
=| EC ral P 
e 
ES ((Spom (Bes Acl*l: = 1) a; 
c=1 


- (onlo A] 
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B Derivation of the General 
Probabilistic Model 


In this appendix we provide the derivations for Sect. 4.2. The derivations 
are based on the joint model defined in (4.7), restated as 


T 
P(X vr Orr) =p(01 | 4)p(41) [vor XPE X). (BY 


t=2 


According to Sect. 4.2, in the following we omit the integration domain, 
where f p(X ,.7) dA ].r is a short hand for fp: [pp(Aı:r) d, ... dX 7 
with D defined as the full domain of Y. The integrals are defined over an 
abstract state space implied by Y. Depending on the concrete manifesta- 
tion of the scene configuration space, the integrals have to be refined or, in 
case of a discrete scene configuration space, replaced by the sum over all 
discrete states. 
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B Derivation of the General Probabilistic Model 


B.1 Posterior Distribution 
For t € {2,...,T}: 


plo1:, € 14) 
p(014) 
z IT. 2 P(On | AA | Ar-ı) 
i p(014) 
= plo, | Aı)p(Aı | X 1)p lO 1:01, Y 142) 
plo; | 034 1)p(014—1) 


WX it | 914) E 


= plo, | A )p(X, | A 1 A 1:41 | 01:41) 


plo, | 0141) 


B.2 Predicted Likelihood 


For t € {2,...,T}: 


(B.2a) 


(B.2b) 


(B.2c) 


(B.2d) 


plos | Ora) = | plor ADAL LAA Ln) A ay 


(B.3a) 


= foto f oaa )P(A,4|014 21) dX, , dA, — (B.3b) 
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p(X iors) 


= f por | P| 0111). 
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(B.3c) 


B.3 Filtering Distribution 


B.3 Filtering Distribution 
For t € {2,...,T}: 
»(X,|014) = [mai | O14) d 14-1 (B.4a) 


plo, | €) f 
m NO A lA Kay O dG B.4b 

plo, | or) PX | X A 110914701) dq ( ) 
AE 


= v | v(x, | 41a) Js | 01:41) dV 14.2 dA, (B.4c) 
PA i| o14-1) 
=v fwd ip(a 01421) da . (B.4d) 
»(«]osa-i) 
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